LocalData / localdata-tiles

A tileserver for LocalData
5 stars 2 forks source link

We should consider MongoHQ Elastic for our production DB #98

Closed prashtx closed 10 years ago

prashtx commented 10 years ago

MongoHQ's Elastic tier seems to provide much better performance (SSD, dedicated RAM that scales with our dataset size) than the MongoLab Sandbox and Shared tiers. Sandbox vs. HQ Elastic is dramatic (to be expected), but MongoLab Shared vs. MongoHQ Elastic is significant.

My dev database, which fits in the sandbox tier of Mongolab (< 0.5 GB) registers as 2 GB of total usage in MongoHQ's Elastic tier. It's running a replica set, so those are probably additive, and it likely counts the mongodb file size, which grows in discrete steps, not incrementally. So we probably have ~ 1.5GB of "usage", which gets rounded up to 2 GB => $36/month. Looking at datafile sizes, I think our production database will still be $36/month and will jump to $54/month next. If my understanding is correct, our disk usage will effectively grow in 1.5 GB steps. Costs are prorated to the day, so we can double check before committing to this for our production DB.

We probably want something we can run performance tests against, too. We can probably create a small database just for that purpose at $18/month. We might be able to add it to the production deployment and gain a little money efficiency, but then we risk impacting our production performance with testing.

Performance report to follow.

prashtx commented 10 years ago

R code/output for evaluating performance

Setup

library(reshape)
library(ggplot2)

opts_knit$set(self.contained = TRUE, upload.fun = imgur_upload, base.dir = "/tmp")
opts_chunk$set(fig.width = 10, fig.height = 10)

Get baseline data from a couple of runs
Get data with this change incorporated

raw = data.frame(labsandbox1 = read.csv("http://s3.amazonaws.com/localdata-private/perf-data/localdata-tiles/simulated-flow/2014-03-11T07:38:14.704Z-44138c07df6b732bdd9b0b0f21097e51e4c3b6ae/times.csv", 
    header = F)$V1)
raw$labsandbox2 = read.csv("http://s3.amazonaws.com/localdata-private/perf-data/localdata-tiles/simulated-flow/2014-03-11T07:53:44.829Z-44138c07df6b732bdd9b0b0f21097e51e4c3b6ae/times.csv", 
    header = F)$V1

raw$labshared1 = read.csv("http://s3.amazonaws.com/localdata-private/perf-data/localdata-tiles/simulated-flow/2014-03-12T14:18:42.610Z-a9ceb589d9fee1a50466c6a35e02588ae11067ea/times.csv", 
    header = F)$V1
raw$labshared2 = read.csv("http://s3.amazonaws.com/localdata-private/perf-data/localdata-tiles/simulated-flow/2014-03-12T14:22:35.239Z-a9ceb589d9fee1a50466c6a35e02588ae11067ea/times.csv", 
    header = F)$V1

raw$hq1 = read.csv("http://s3.amazonaws.com/localdata-private/perf-data/localdata-tiles/simulated-flow/2014-03-12T14:27:37.420Z-a9ceb589d9fee1a50466c6a35e02588ae11067ea/times.csv", 
    header = F)$V1
raw$hq2 = read.csv("http://s3.amazonaws.com/localdata-private/perf-data/localdata-tiles/simulated-flow/2014-03-12T14:29:44.713Z-a9ceb589d9fee1a50466c6a35e02588ae11067ea/times.csv", 
    header = F)$V1

rawm = melt(raw, id = 0)

d = data.frame(p = seq(0, 1, 0.01))

d$labsandbox1 = quantile(raw$labsandbox1, d$p)
d$labsandbox2 = quantile(raw$labsandbox2, d$p)
d$labshared1 = quantile(raw$labshared1, d$p)
d$labshared2 = quantile(raw$labshared2, d$p)
d$hq1 = quantile(raw$hq1, d$p)
d$hq2 = quantile(raw$hq2, d$p)

Plot means

means = aggregate(value ~ variable, FUN = mean, data = rawm)
ggplot(means) + aes(x = variable, y = value, fill = variable, label = round(value)) + 
    geom_bar(stat = "identity") + labs(x = "run", y = "mean (ms)", title = "mean response times") + 
    geom_text(aes(vjust = 0))

plot of chunk unnamed-chunk-2

Compute quantiles

dm = melt(d, id = "p")

Plot perc99

ggplot(dm[dm$p == 0.99, ]) + geom_bar(stat = "identity") + aes(x = variable, 
    y = value, fill = variable, label = round(value)) + labs(x = "run", y = "ms", 
    title = "perc99 response times") + geom_text(aes(vjust = 0))

plot of chunk unnamed-chunk-4

Plot quantiles, lower is better.

ggplot(dm) + aes(x = p, y = value, color = variable) + geom_line() + labs(y = "response time (ms)", 
    title = "Response time percentiles")

plot of chunk unnamed-chunk-5

prashtx commented 10 years ago

Thoughts @hampelm? The test databases are terribly expensive, but no sense keeping them provisioned too long.

hampelm commented 10 years ago

What's your preference?

I'd vote for staying on mongolab shared for now. The improvements are significant, but there's other stuff we can spend time on, and we haven't had tile rendering complaints recently.

prashtx commented 10 years ago

I'm leaning pretty heavily toward MongoHQ Elastic. 7 seconds is still a pretty long time to wait for a tile, so I'd like to bring that down before folks start complaining.

We're also about to support a lot more computation with the stats-in-a-polygon code. That should similarly benefit from the mongohq setup. We can do some performance tests on that code, though, before switching. Ideally, we can show dashboard users stats for the map view as they pan and zoom. If the mongohq setup makes the difference between a fluid experience and a frustrating one, I think we should switch.

hampelm commented 10 years ago

Then let's go for it.

notifications@github.com wrote:

I'm leaning pretty heavily toward MongoHQ Elastic. 7 seconds is still a pretty long time to wait for a tile, so I'd like to bring that down before folks start complaining.

We're also about to support a lot more computation with the stats-in-a-polygon code. That should similarly benefit from the mongohq setup. We can do some performance tests on that code, though, before switching. Ideally, we can show dashboard users stats for the map view as they pan and zoom. If the mongohq setup makes the difference between a fluid experience and a frustrating one, I think we should switch.

— Reply to this email directly or view it on GitHub.