Closed allthatilk closed 7 years ago
@allthatilk: I think the join
there just skips datasets with zero resources… You could drop it and instead do a if not dataset.resources: continue
or equivalent. The resources get lazy loaded, so I don’t think the join
speeds things up.
Related to that, I wonder if this needs a db.subqueryload
to eager load resources, and avoid the query taking a reeeeally long time. I.e.:
dataset_resources = db.session.query(Dataset).options(db.subqueryload(Dataset.resources))
This doesn’t solve the activity count, though, which is still mega slow. I guess it’s probably possible to eager load that too? (Otherwise, consider paginating the results I guess.)
@andylolz Thanks for the suggestion. I'm working on it now.
As for speed, this is more of a hacky solution to Travis timing out before completing the website test we made for last_successful_fetch
as we need a way to monitor the datastore's update process (and catch problems on the publisher end). Basically, as long as it's only doing one slow query, that should (hopefully) be good enough, rather than the only other current option of making a loop of thousands of querys.
Basically, as long as it's only doing one slow query, that should (hopefully) be good enough, rather than the only other current option of making a loop of thousands of querys.
So I don’t know if you know this already (apologies if you do!) but you can run a shell with:
$ cd IATI-Datastore
$ source env/bin/activate
$ iati shell
Then in the shell, you can set this very magical and useful setting:
app.config['SQLALCHEMY_RECORD_QUERIES'] = True
…to store all the queries that SQLAlchemy is running. Super helpful for debugging! (NB app
is already set in the context, so you don’t have to import
anything extra to do this.)
Then if you do one necessary import:
from iatilib.model import Dataset
…and then run your code: https://github.com/IATI/IATI-Datastore/blob/52c8b2a51a7542006c650557ae7771fd2c80c278/iati_datastore/iatilib/frontend/api1.py#L74-L86
…and then:
from flask_sqlalchemy import get_debug_queries
len(get_debug_queries())
…you’ll see that even though it might look like a single query, behind the scenes SQLAlchemy is doing loooots of queries (approx 10,000). So if you don’t explicitly tell SQLAlchemy to load relationships ahead of time, it will likely be just as slow, and Travis will still timeout 😭
If you have a working alternative a PR would be appreciated.
So here’s the patch I’m suggesting, but note that while it improves matters, it only halves the number of queries. That’s because dataset.resources[0].activities.count()
still runs lots of queries.
diff --git a/iati_datastore/iatilib/frontend/api1.py b/iati_datastore/iatilib/frontend/api1.py
index 8e1dc79..1f5319d 100644
--- a/iati_datastore/iatilib/frontend/api1.py
+++ b/iati_datastore/iatilib/frontend/api1.py
@@ -71,10 +71,12 @@ def about_dataset(dataset):
@api.route('/about/datasets/nest')
def nest_about_dataset():
"""Output a JSON formatted list of dataset dictionaries containing their resource details."""
- dataset_resources = db.session.query(Dataset).join(Dataset.resources)
+ dataset_resources = db.session.query(Dataset).options(db.subqueryload(Dataset.resources))
datasets = dict()
for dataset in dataset_resources:
+ if len(dataset.resources) == 0:
+ continue
resources = {
'url': dataset.resources[0].url,
'last_fetch': dataset.resources[0].last_fetch.isoformat() if dataset.resources[0].last_fetch else None,
I’m sure it’s possible to do better, but I’m also not very familiar with SQLAlchemy.
Ooo! Thanks for the debugging tips! Much appreciated 😄 Your solution works great! We'll see how Travis likes it now, if it's still too slow we might have to change the approach for the call and have it as a basic fetch status call instead so we don't have to load the number of datasets.
@allthatilk Is this complete and ready to review?
(the Datastore doesn't yet appear to have a complete
label)
Created a new API call to provide a list of dataset dictionaries with their resources in order to test
last_successful_fetch
without worrying about timeout whilst maintaining backwards compatibility for other Datastore API calls.