Backend Infrastructure - Githubissues

hxu commented 10 years ago

Should we stay with Google App Engine? We've encountered some performance issues with the DB backend (though maybe we are just using it wrong). Do these issues warrant looking into alternatives?

If so, what should we look at? How do we manage the deploy process?

I've heard good things about Salt and Docker but have not used either. These would let us use any compute instance provider.

hupili commented 10 years ago

I've also heard of the two but have not tried them.

I haven't done a thorough research of NDB. It seems NDB is not designed to deal with relational things. There are at least three headaches, which are trivial in relational DB:

count
unique
join (e.g. substitute CA codes with CA names)

I more feel we use it in the wrong way. With new data schema, some of the above operations can be eliminated.

We only keep the Datapoint model:

5 (+2) fields: ca, table, row (dim1), column (dim2), value, + district, + region
row and column only contain canonical identifiers, so it is language agnostic
ca, district, region is just the code
table can still use our previous ordering.

The key is to remove any presentation related stuffs away from Datapoint. In this way, we can easily pivot that datapoint table. FE can load a set of dicts which translates identifiers to human readable strings in different languages.

district and region are subject to further discussion, depending on whether we want to frequently pivot in this granularity. Apps can still work without the two, just filter IN a set of CAs. It's a tradeoff of index cost and query cost.

hupili commented 10 years ago

The above is just for census data set.

I'm not sure whether NDB still works well if we add more data sets.

hxu commented 10 years ago

For the CSV download of the data, I think it is better to have the district and region, so that people who want the full dataset don't have to manually join it with the district and region maps. If we already have it there, then the only reason not to include it in the database would be for space saving, I think? If that is the case, our data is not so big, so why not include them? It would make the code on the front end a bit simpler. Maybe make indexed=False on the properties.

hupili commented 10 years ago

:+1: indexed=False

hupili commented 10 years ago

Maybe we still need to index them, to answer questions like: what's the average incoming of this district.

hxu commented 10 years ago

Probably just pick one and try it out, we can change it later if it doesn't work.

On Wed, Jan 29, 2014 at 10:48 AM, HU, Pili notifications@github.com wrote:

Maybe we still need to index them, to answer questions like: what's the average incoming of this district.

— Reply to this email directly or view it on GitHubhttps://github.com/hxu/hk_census_explorer/issues/11#issuecomment-33551706 .

hupili commented 10 years ago

If we stick to GAE hosting, there's a lightweight solution for backend API -- just deploy with the combined CSV. Upon API query, load the CSV into pandas directly. Then many operations become trivial.

Loading the CSV takes about 0.3s on my machine. Table manipulation is blazing fast. Only one problem is the memory footprint. During hackathon, we met this error several times:

Exceeded soft private memory limit with 153.535 MB after servicing 26 requests total.

By loading the combined CSV, it consumes VSZ=150MB, RSS=54MB. If GAE counts on RSS, we are pretty safe.

hxu commented 10 years ago

Interesting idea, but I think maybe not extensible. We already know that there are several data sets that we want to add to the one we already have, so I think we should lay the groundwork with a backend that we know will work.

I'm also open to just using a dedicated SQL server. Or we can deploy on a generic compute instance in Google Cloud or AWS. There's a bit of work involved, but should be able to get it up and running in a day.

On Tue, Feb 4, 2014 at 4:44 PM, HU, Pili notifications@github.com wrote:

If we stick to GAE hosting, there's a lightweight solution for backend API -- just deploy with the combined CSV. Upon API query, load the CSV into pandas directly. Then many operations become trivial.

Loading the CSV takes about 0.3s on my machine. Table manipulation is blazing fast. Only one problem is the memory footprint. During hackathon, we met this error several times:

Exceeded soft private memory limit with 153.535 MB after servicing 26 requests total.

By loading the combined CSV, it consumes VSZ=150MB, RSS=54MB. If GAE counts on RSS, we are pretty safe.

— Reply to this email directly or view it on GitHubhttps://github.com/hxu/hk_census_explorer/issues/11#issuecomment-34040084 .

hupili commented 10 years ago

The idea was implemented in django, https://github.com/wq/django-rest-pandas

debuggingfuture commented 10 years ago

+1 for Saltstack, very easy to use and extend esp for python geeks like you guys

hxu commented 10 years ago

@hupili once you have time this weekend, I think we may want to focus first on the performance of the pandas backend. When working on on the frontend, I noticed it would sometimes take several seconds for the data to load. It does seem like multiple users could very easily break it. I don't understand why it's that slow since the entire datastructure is in memory.

hupili commented 10 years ago

I haven't checked the internals of pandas, but I don't think it has the DB like index. The filtering is done by passing all data points. The per-table optimization I mentioned should work.

gazetteerhk / census_explorer

Backend Infrastructure #11