PMA-2020 / pma-api

The best way to programmatically access PMA2020 data.
http://api.pma2020.org
MIT License
1 stars 3 forks source link

Cache responses whenever a query is made #33

Open joeflack4 opened 6 years ago

joeflack4 commented 6 years ago

Description

We should cache API responses so that we can increase performance.

There are some options: (1) cache every possible query, (2) cache only popular queries, (3) cache all queries responses one-by-one as queries are actually made.

For this issue, we will go with option (3). The reason for this is that it is much easier to implement than option (2), which we can always implement later. Option (1) is not a good option because it would require an unreasonable amount of storage space and take too long to run when data is updated.

Current behavior

Only certain, specific queries are cached. These are limited to simple endpoint calls without any query parameters, such as the datalab/init endpoint.

Desired behavior

Whenever a query is made, we will cache the result so that the next time the query is made, the response will be faster (this is shown as option 3 above).

Tasks

1. Add functionality to cache all queries responses one-by-one as queries are actually made.

Approach A: A built-in PostgreSQL solution.

See if the database we use, i.e. PostgreSQL, already has a feature for us to easily implement this. If it does, it will probably be better than something we can implement.

Approach B: Implement a custom solution.

We already have a cache table. However, it is only caching datalab/init, and it is caching that when the database is first created (code for this can be seen in manage.py).

If you look in the database, you should see 1 row with the following values:

Here's an example of the kind of row you should add to the cache table any time a user makes a query (by visiting a URL / doing a GET request to a URL).

Example: User goes to http://api.pma2020.org/v1/datalab/combos?survey=PMA2014_BFR1,PMA2015_BFR2&indicator=mcp_mar&characteristicGroup=age_5yr_int&

Other concerns to be addressed

If implementing this approach "B", we would need to make sure that requests that have not yet been cached are not slowed down either. If we were to accept a response and cache it before returning a response, we should make sure that this does not slow things down significantly. If caching before returning a response does slow things down significantly, then we will need to return the response first and cache afterwards. There is no native way in Flask to do this. So some options I can think of are (i) multithreaded approach, (ii) triggering a job using a task management queue (I've never done this before; I'm thinking something like "Celery". Maybe this is not necessary / no different than a multithreaded approach. There are also some topics on Stackoverflow regarding this (example: https://stackoverflow.com/questions/18082683/need-to-execute-a-function-after-returning-the-response-in-flask).

2. Ensure that when new data updates are made to the DB, all queries that were cached before should be cached again

This would be a modification to manage.py db --overwrite command. Basically you would create the database completely again, but you would somehow need to maintain a history of every URL that has been queried before, and re-cache everything again. Right now, this command erases ALL tables. Perhaps a good solution would be to instead erase every table except the cache table. Then, after the script has run, we can have a process that goes through every row in the cache table and, if the md5 of the current api data file is different than what is shown in that row, it should re-cache that row. This will basically re-cache all of the rows if the data has changed at all. That is OK.

Task list