RDCEP / EDE

MIT License
2 stars 1 forks source link

Intermediate operations and cache #16

Closed njmattes closed 8 years ago

njmattes commented 8 years ago

@ricardobarroslourenco @legendOfZelda Imagine a user selects two grids, A and B, and wants to view the result of A / B (let's call it AB*) aggregated to GADM0. Easy enough to send a JSON blob of operation parameters to the API, and send a JSON representation of a subset of the result back to the GUI. But then imagine they want to see AB* aggregated to different regions, GADM1 or FPU or whatever.

If AB* isn't cached / stored anywhere, the calculation needs to be done from scratch for each request. Because we've decide that the API should be (relatively) RESTful, it should also strive to be stateless, so AB* really shouldn't be in a user session. So, do we store these intermediate results in flask's cache? Cache them in the database? Recalculate them on each request?

njmattes commented 8 years ago

Hmm. If all the ops are happening in the db, and happening on whatever (presumably?) binary representations of the gridded data and polygons are stored in the db, then caching AB* won't do us any good anyway, unless it's a cache in the db itself. By the time flask has any data to cache, it'll be either a python object or a JSON blob—neither of which will be useful to the db ops. Is that right?

ghost commented 8 years ago

so we have 2 options: cache within the db or outside of it, i.e. within flask. let's look at the first option.

say the db computes A / B to GADM0 level, then db-internally, this amounts to (at least) two steps: computing A / B and then doing the aggregation to GADM0. the db might not necessarily materialize A / B (and thus cache it) but we can probably tell it todo so.

if we want to cache outside of the db, yes, it's going to be trickier, certainly for this kind of subsequent queries. first, it's only going to work if the user first requests an aggregation to GADM1 and then to GADM0, and not the other way round (unless of course the db not just returns the aggregation of A / B but also A / B itself which is going to be much larger so not what we want to do of course). second, in the case of first GADM1 and then GADM0 we would have to take the GADM1 aggregates and aggregate those again to GADM0, but this would have to be done within python, which is going to be ugly because we would have to know which GADM1 regions are contained in which GADM0 regions, which is something the db certainly knows, and we would be implementing that from scratch within python.

however, what we can do of course is after both GADM0 and GADM1 aggregations have completed, have both of them cached on some kind of tile server (which actually does not just store tiles but also aggregation results).

coming back to doing the caching-within-the-db solution, the db should have a cache per user and a global cache. these can be configured of course but user-cache is usually small and global cache is shared by the queries of all other users. so i think we're quite limited in terms of how much we can cache within the db. that brings us back to the tile server solution.

ricardobarroslourenco commented 8 years ago

I agree with @legendOfZelda on this. Usually this kind of A / B of GADM 0 would be done as an atomic operation, and this would not store any intermediate results. I actually don't know if Postgres has some capabilities on this, because I saw this at their documentation:

Temporary data files used in larger SQL queries for sorts, materializations and intermediate results are not currently checksummed, nor will WAL records be written for changes to those files

Which indicates to me that this kind of cache would be done in an EDE(python) level. This could be done via an object caching, but that backend would be more CPU intensive.

njmattes commented 8 years ago

So is not caching any of these operation an option? Ie, are the ops fast enough in the db that there's no need for caching? If we have datasets that undergo continuous updating, I can imagine we'll eventually run into some lag from table locks. But perhaps we worry about that only when we have to.

ricardobarroslourenco commented 8 years ago

Well, I actually don't know if it is possible. I need to look better on this, because actually I'm not used to work with Postgres. I agree on the point of future. I think that once loading stuff, and also with more user requests, themes such as caching, and distributing processes will emerge.

njmattes commented 8 years ago

I think that's reasonable. I'm closing this issue for now, so we can focus on other items.