We should cache API responses so that we can increase performance.
There are some options: (1) cache every possible query, (2) cache only popular queries, (3) cache all queries responses one-by-one as queries are actually made.
For this issue, we will go with option (3). The reason for this is that it is much easier to implement than option (2), which we can always implement later. Option (1) is not a good option because it would require an unreasonable amount of storage space and take too long to run when data is updated.
Current behavior
Only certain, specific queries are cached. These are limited to simple endpoint calls without any query parameters, such as the datalab/init endpoint.
Desired behavior
Whenever a query is made, we will cache the result so that the next time the query is made, the response will be faster (this is shown as option 3 above).
Tasks
1. Add functionality to cache all queries responses one-by-one as queries are actually made.
Approach A: A built-in PostgreSQL solution.
See if the database we use, i.e. PostgreSQL, already has a feature for us to easily implement this. If it does, it will probably be better than something we can implement.
Approach B: Implement a custom solution.
We already have a cache table. However, it is only caching datalab/init, and it is caching that when the database is first created (code for this can be seen in manage.py).
If you look in the database, you should see 1 row with the following values:
Here's an example of the kind of row you should add to the cache table any time a user makes a query (by visiting a URL / doing a GET request to a URL).
source_data_md5: [You can get the md5 value of the current dataset being used by going to the "metadata" table and looking for a row with type "api". Then get the value for its "md5_checksum" field.]
Other concerns to be addressed
If implementing this approach "B", we would need to make sure that requests that have not yet been cached are not slowed down either. If we were to accept a response and cache it before returning a response, we should make sure that this does not slow things down significantly. If caching before returning a response does slow things down significantly, then we will need to return the response first and cache afterwards. There is no native way in Flask to do this. So some options I can think of are (i) multithreaded approach, (ii) triggering a job using a task management queue (I've never done this before; I'm thinking something like "Celery". Maybe this is not necessary / no different than a multithreaded approach. There are also some topics on Stackoverflow regarding this (example: https://stackoverflow.com/questions/18082683/need-to-execute-a-function-after-returning-the-response-in-flask).
2. Ensure that when new data updates are made to the DB, all queries that were cached before should be cached again
This would be a modification to manage.py db --overwrite command. Basically you would create the database completely again, but you would somehow need to maintain a history of every URL that has been queried before, and re-cache everything again. Right now, this command erases ALL tables. Perhaps a good solution would be to instead erase every table except the cache table. Then, after the script has run, we can have a process that goes through every row in the cache table and, if the md5 of the current api data file is different than what is shown in that row, it should re-cache that row. This will basically re-cache all of the rows if the data has changed at all. That is OK.
Task list
[ ] 1. Add functionality to cache all queries responses one-by-one as queries are actually made.
[ ] 2. Ensure that when new data updates are made to the DB, all queries that were cached before should be cached again
Description
We should cache API responses so that we can increase performance.
There are some options: (1) cache every possible query, (2) cache only popular queries, (3) cache all queries responses one-by-one as queries are actually made.
For this issue, we will go with option (3). The reason for this is that it is much easier to implement than option (2), which we can always implement later. Option (1) is not a good option because it would require an unreasonable amount of storage space and take too long to run when data is updated.
Current behavior
Only certain, specific queries are cached. These are limited to simple endpoint calls without any query parameters, such as the
datalab/init
endpoint.Desired behavior
Whenever a query is made, we will cache the result so that the next time the query is made, the response will be faster (this is shown as option 3 above).
Tasks
1. Add functionality to cache all queries responses one-by-one as queries are actually made.
Approach A: A built-in PostgreSQL solution.
See if the database we use, i.e. PostgreSQL, already has a feature for us to easily implement this. If it does, it will probably be better than something we can implement.
Approach B: Implement a custom solution.
We already have a
cache
table. However, it is only cachingdatalab/init
, and it is caching that when the database is first created (code for this can be seen inmanage.py
).If you look in the database, you should see 1 row with the following values:
key
:v1/datalab/init
value
:{"characteristicGroupCategories":[{"characteristicGroups":[{"definition.id":"3-hQWbkf",... }
mimetype
:application/json
source_data_md5
:339ce036bdee399d449f95a1d4b3bb8f
Here's an example of the kind of row you should add to the
cache
table any time a user makes a query (by visiting a URL / doing a GET request to a URL).Example: User goes to http://api.pma2020.org/v1/datalab/combos?survey=PMA2014_BFR1,PMA2015_BFR2&indicator=mcp_mar&characteristicGroup=age_5yr_int&
key
:v1/datalab/combos?survey=PMA2014_BFR1,PMA2015_BFR2&indicator=mcp_mar&characteristicGroup=age_5yr_int&
value
: [lots of JSON data would go here]mimetype
:application/json
source_data_md5
: [You can get the md5 value of the current dataset being used by going to the "metadata" table and looking for a row with type "api". Then get the value for its "md5_checksum" field.]Other concerns to be addressed
If implementing this approach "B", we would need to make sure that requests that have not yet been cached are not slowed down either. If we were to accept a response and cache it before returning a response, we should make sure that this does not slow things down significantly. If caching before returning a response does slow things down significantly, then we will need to return the response first and cache afterwards. There is no native way in Flask to do this. So some options I can think of are (i) multithreaded approach, (ii) triggering a job using a task management queue (I've never done this before; I'm thinking something like "Celery". Maybe this is not necessary / no different than a multithreaded approach. There are also some topics on Stackoverflow regarding this (example: https://stackoverflow.com/questions/18082683/need-to-execute-a-function-after-returning-the-response-in-flask).
2. Ensure that when new data updates are made to the DB, all queries that were cached before should be cached again
This would be a modification to
manage.py db --overwrite
command. Basically you would create the database completely again, but you would somehow need to maintain a history of every URL that has been queried before, and re-cache everything again. Right now, this command erases ALL tables. Perhaps a good solution would be to instead erase every table except thecache
table. Then, after the script has run, we can have a process that goes through every row in the cache table and, if themd5
of the currentapi
data file is different than what is shown in that row, it should re-cache that row. This will basically re-cache all of the rows if the data has changed at all. That is OK.Task list