New api call - Githubissues

allthatilk commented 7 years ago

Created a new API call to provide a list of dataset dictionaries with their resources in order to test last_successful_fetch without worrying about timeout whilst maintaining backwards compatibility for other Datastore API calls.

coveralls commented 7 years ago

Coverage decreased (-0.03%) to 48.898% when pulling 30fcf013f454c4e38a6616bb2a34022d44939fbb on new-api-call into 852ec9a33c0f44f674fb6cdcd6cf7c9fe4ac9e8e on master.

coveralls commented 7 years ago

Coverage increased (+0.07%) to 48.996% when pulling 92a507b307c9c04dc1c88e6c11accde83e07ebf7 on new-api-call into 852ec9a33c0f44f674fb6cdcd6cf7c9fe4ac9e8e on master.

coveralls commented 7 years ago

Coverage increased (+0.07%) to 48.995% when pulling 52c8b2a51a7542006c650557ae7771fd2c80c278 on new-api-call into 852ec9a33c0f44f674fb6cdcd6cf7c9fe4ac9e8e on master.

coveralls commented 7 years ago

Coverage increased (+0.07%) to 48.996% when pulling 52c8b2a51a7542006c650557ae7771fd2c80c278 on new-api-call into 852ec9a33c0f44f674fb6cdcd6cf7c9fe4ac9e8e on master.

andylolz commented 7 years ago

@allthatilk: I think the join there just skips datasets with zero resources… You could drop it and instead do a if not dataset.resources: continue or equivalent. The resources get lazy loaded, so I don’t think the join speeds things up.

Related to that, I wonder if this needs a db.subqueryload to eager load resources, and avoid the query taking a reeeeally long time. I.e.:

dataset_resources = db.session.query(Dataset).options(db.subqueryload(Dataset.resources))

This doesn’t solve the activity count, though, which is still mega slow. I guess it’s probably possible to eager load that too? (Otherwise, consider paginating the results I guess.)

allthatilk commented 7 years ago

@andylolz Thanks for the suggestion. I'm working on it now.

As for speed, this is more of a hacky solution to Travis timing out before completing the website test we made for last_successful_fetch as we need a way to monitor the datastore's update process (and catch problems on the publisher end). Basically, as long as it's only doing one slow query, that should (hopefully) be good enough, rather than the only other current option of making a loop of thousands of querys.

andylolz commented 7 years ago

Basically, as long as it's only doing one slow query, that should (hopefully) be good enough, rather than the only other current option of making a loop of thousands of querys.

So I don’t know if you know this already (apologies if you do!) but you can run a shell with:

$ cd IATI-Datastore
$ source env/bin/activate
$ iati shell

Then in the shell, you can set this very magical and useful setting:

app.config['SQLALCHEMY_RECORD_QUERIES'] = True

…to store all the queries that SQLAlchemy is running. Super helpful for debugging! (NB app is already set in the context, so you don’t have to import anything extra to do this.)

Then if you do one necessary import:

from iatilib.model import Dataset

…and then run your code: https://github.com/IATI/IATI-Datastore/blob/52c8b2a51a7542006c650557ae7771fd2c80c278/iati_datastore/iatilib/frontend/api1.py#L74-L86

…and then:

from flask_sqlalchemy import get_debug_queries

len(get_debug_queries())

…you’ll see that even though it might look like a single query, behind the scenes SQLAlchemy is doing loooots of queries (approx 10,000). So if you don’t explicitly tell SQLAlchemy to load relationships ahead of time, it will likely be just as slow, and Travis will still timeout 😭

If you have a working alternative a PR would be appreciated.

So here’s the patch I’m suggesting, but note that while it improves matters, it only halves the number of queries. That’s because dataset.resources[0].activities.count() still runs lots of queries.

diff --git a/iati_datastore/iatilib/frontend/api1.py b/iati_datastore/iatilib/frontend/api1.py
index 8e1dc79..1f5319d 100644
--- a/iati_datastore/iatilib/frontend/api1.py
+++ b/iati_datastore/iatilib/frontend/api1.py
@@ -71,10 +71,12 @@ def about_dataset(dataset):
 @api.route('/about/datasets/nest')
 def nest_about_dataset():
     """Output a JSON formatted list of dataset dictionaries containing their resource details."""
-    dataset_resources = db.session.query(Dataset).join(Dataset.resources)
+    dataset_resources = db.session.query(Dataset).options(db.subqueryload(Dataset.resources))
     datasets = dict()

     for dataset in dataset_resources:
+        if len(dataset.resources) == 0:
+            continue
         resources = {
             'url': dataset.resources[0].url,
             'last_fetch': dataset.resources[0].last_fetch.isoformat() if dataset.resources[0].last_fetch else None,

I’m sure it’s possible to do better, but I’m also not very familiar with SQLAlchemy.

allthatilk commented 7 years ago

Ooo! Thanks for the debugging tips! Much appreciated 😄 Your solution works great! We'll see how Travis likes it now, if it's still too slow we might have to change the approach for the call and have it as a basic fetch status call instead so we don't have to load the number of datasets.

coveralls commented 7 years ago

Coverage increased (+0.5%) to 49.394% when pulling c9e44e282977a1ec26ee80d5300979e8407f0f8f on new-api-call into 852ec9a33c0f44f674fb6cdcd6cf7c9fe4ac9e8e on master.

coveralls commented 7 years ago

Coverage increased (+0.5%) to 49.394% when pulling b7d4f7f0748e192bed4e39a8565f8733e333204d on new-api-call into 852ec9a33c0f44f674fb6cdcd6cf7c9fe4ac9e8e on master.

hayfield commented 7 years ago

@allthatilk Is this complete and ready to review?

(the Datastore doesn't yet appear to have a complete label)

coveralls commented 7 years ago

Coverage increased (+0.5%) to 49.394% when pulling c55118254f395b252e6c6806e2866f5b3befb3bd on new-api-call into 852ec9a33c0f44f674fb6cdcd6cf7c9fe4ac9e8e on master.

coveralls commented 7 years ago

Coverage increased (+0.5%) to 49.396% when pulling d8e6e2a2a4526995d9d0b8c5c9e97671008790b3 on new-api-call into 852ec9a33c0f44f674fb6cdcd6cf7c9fe4ac9e8e on master.

IATI / IATI-Datastore

New api call #278