codeforIATI / iati-datastore

An open-source datastore for IATI data with RESTful web API providing XML, JSON, CSV plus ETL tools
https://datastore.codeforiati.org
Other
1 stars 1 forks source link

Transactions CSV (and Excel) is slow: Avoid repeated queries #346

Open odscjames opened 2 years ago

odscjames commented 2 years ago

Describe the bug

Transactions CSV (and Excel) is slow

Issue

For every hit on the transaction, we get lots of individual DB queries:

2022-02-22 11:26:12.415 UTC [6307] app@app LOG:  duration: 0.176 ms  statement: SELECT country_percentage.id AS country_percentage_id, country_percentage.percentage AS country_percentage_percentage, country_percentage.name AS country_percentage_name, country_percentage.country AS country_percentage_country, country_percentage.activity_id AS country_percentage_activity_id, country_percentage.transaction_id AS country_percentage_transaction_id 
    FROM country_percentage 
    WHERE 8005 = country_percentage.transaction_id
2022-02-22 11:26:12.416 UTC [6307] app@app LOG:  duration: 0.160 ms  statement: SELECT region_percentage.id AS region_percentage_id, region_percentage.percentage AS region_percentage_percentage, region_percentage.name AS region_percentage_name, region_percentage.region AS region_percentage_region, region_percentage.activity_id AS region_percentage_activity_id, region_percentage.transaction_id AS region_percentage_transaction_id 
    FROM region_percentage 
    WHERE 8005 = region_percentage.transaction_id
2022-02-22 11:26:12.418 UTC [6307] app@app LOG:  duration: 0.324 ms  statement: SELECT sector_percentage.id AS sector_percentage_id, sector_percentage.text AS sector_percentage_text, sector_percentage.activity_id AS sector_percentage_activity_id, sector_percentage.transaction_id AS sector_percentage_transaction_id, sector_percentage.sector AS sector_percentage_sector, sector_percentage.vocabulary AS sector_percentage_vocabulary, sector_percentage.percentage AS sector_percentage_percentage 
    FROM sector_percentage 
    WHERE 8005 = sector_percentage.transaction_id

Those 3 queries are repeated heavily.

These python functions get the extra data:

There are also some queries to get extra info on organisation, participation and activity tables that did not repeat in my test data set (I think because SQLLachemy caches?), but should repeat on larger data sets.

2022-02-22 11:26:12.057 UTC [6307] app@app LOG:  duration: 2.174 ms  statement: SELECT activity.iati_identifier AS activity_iati_identifier, activity.hierarchy AS activity_hierarchy, activity.default_language AS activity_default_language, activity.last_updated_datetime AS activity_last_updated_datetime, activity.last_change_datetime AS activity_last_change_datetime, activity.resource_url AS activity_resource_url, activity.reporting_org_id AS activity_reporting_org_id, activity.start_planned AS activity_start_planned, activity.start_actual AS activity_start_actual, activity.end_planned AS activity_end_planned, activity.end_actual AS activity_end_actual, activity.title AS activity_title, activity.description AS activity_description, activity.default_currency AS activity_default_currency, activity.raw_xml AS activity_raw_xml, activity.version AS activity_version, activity.major_version AS activity_major_version, activity.created AS activity_created, activity.activity_status AS activity_activity_status, activity.collaboration_type AS activity_collaboration_type, activity.default_finance_type AS activity_default_finance_type, activity.default_flow_type AS activity_default_flow_type, activity.default_aid_type AS activity_default_aid_type, activity.default_tied_status AS activity_default_tied_status 
    FROM activity 
    WHERE activity.iati_identifier = '1301-4.0000'
2022-02-22 11:26:12.081 UTC [6307] app@app LOG:  duration: 0.211 ms  statement: SELECT organisation.id AS organisation_id, organisation.ref AS organisation_ref, organisation.name AS organisation_name, organisation.type AS organisation_type 
    FROM organisation 
    WHERE organisation.id = 145
2022-02-22 11:26:12.084 UTC [6307] app@app LOG:  duration: 0.479 ms  statement: SELECT participation.activity_identifier AS participation_activity_identifier, participation.organisation_id AS participation_organisation_id, participation.role AS participation_role 
    FROM participation 
    WHERE '1301-4.0000' = participation.activity_identifier

Possible solutions

These actual DB queries are really fast as you can see above, but they do add up and there are a lot of them.

If we are lucky some joining on the original query to avoid later queries will really boost this.

odscjames commented 2 years ago

First pull request done. However there is more we might be able to do here, there are still secondary SQLs happening.

In particular, activity is joined to the main transaction SQL just for sorting then not selected, so secondary SQLs are made to load the activity later! Adding orm.joinedload(Transaction.activity), would probably be good here but then the sort by starts to complain, so that needs more attention. Note that many transactions may have one activity here, so it's less certain there will be good gains here but it seems worth trying.

odscjames commented 2 years ago

https://github.com/codeforIATI/iati-datastore/pull/350 speeds up by_sector & by_country.

When doing the by_country one, there are still a lot of SQL fetches for individual organisations. Maybe there is another quick speed up possible here.