cmu-delphi / delphi-epidata

An open API for epidemiological data.
https://cmu-delphi.github.io/delphi-epidata/
MIT License
101 stars 68 forks source link

Convert covidcast signal OR clauses to UNION #763

Open krivard opened 2 years ago

krivard commented 2 years ago

The following query returns 248 rows and takes 30 minutes to compute:

SELECT t.geo_type, t.geo_value, t.source, t.signal, t.time_value, t.value 
FROM covidcast t 
WHERE ((t.source = 'jhu-csse' 
AND (t.signal = 'confirmed_7dav_incidence_prop' OR t.signal = 'deaths_7dav_incidence_prop'))) 
AND ((t.geo_type = 'nation' AND (t.geo_value = 'us'))) 
AND ((t.time_type = 'day' AND (t.time_value BETWEEN 20210706 AND 20211106))) 
AND (t.is_latest_issue IS TRUE) 
ORDER BY t.geo_type ASC, t.geo_value ASC, t.source ASC, t.signal ASC, t.time_value ASC 
LIMIT 10000001;

Either one of the signals on its own runs in less than 1s each, as does the UNION of the two:

SELECT t.geo_type, t.geo_value, t.source, t.signal, t.time_value, t.value 
FROM covidcast t 
WHERE ((t.source = 'jhu-csse' AND (t.signal = 'confirmed_7dav_incidence_prop'))) 
AND ((t.geo_type = 'nation' AND (t.geo_value = 'us'))) 
AND ((t.time_type = 'day' AND (t.time_value BETWEEN 20210706 AND 20211106))) 
AND (t.is_latest_issue IS TRUE) 
UNION ALL 
SELECT t.geo_type, t.geo_value, t.source, t.signal, t.time_value, t.value 
FROM covidcast t 
WHERE ((t.source = 'jhu-csse' AND (t.signal = 'deaths_7dav_incidence_prop'))) 
AND ((t.geo_type = 'nation' AND (t.geo_value = 'us'))) 
AND ((t.time_type = 'day' AND (t.time_value BETWEEN 20210706 AND 20211106))) 
AND (t.is_latest_issue IS TRUE) 
ORDER BY geo_type ASC, geo_value ASC, source ASC, `signal` ASC, time_value ASC 
LIMIT 10000001;

If possible, we should switch how we build multiple-signal queries to use UNION instead of OR.

krivard commented 2 years ago

Here is where we parse source-signal pairs out of the API request: https://github.com/cmu-delphi/delphi-epidata/blob/b8aaab83e43ae7b13a8879f09c5d7faa175efc80/src/server/endpoints/covidcast.py#L46-L57

Here is where we set the source-signal filters in the WHERE clause of the SQL query we're building: https://github.com/cmu-delphi/delphi-epidata/blob/b8aaab83e43ae7b13a8879f09c5d7faa175efc80/src/server/endpoints/covidcast.py#L156

We're currently just doing an append, which treats multiple source-signal pairs like an OR: https://github.com/cmu-delphi/delphi-epidata/blob/b8aaab83e43ae7b13a8879f09c5d7faa175efc80/src/server/_query.py#L434-L452

Here is where a list of conditions get turned into an AND of ORs: https://github.com/cmu-delphi/delphi-epidata/blob/b8aaab83e43ae7b13a8879f09c5d7faa175efc80/src/server/_query.py#L360-L366

...where we want signals in particular to turn into UNIONs

krivard commented 2 years ago

Tests of the server code should live in here if they do not access a database:

https://github.com/cmu-delphi/delphi-epidata/tree/dev/tests/server

and in here if they do access a database:

https://github.com/cmu-delphi/delphi-epidata/tree/dev/integrations/server

melange396 commented 2 months ago

we now have a different schema with better indexes compared to when this issue was created. its not clear that this change will still be effective.