cmu-delphi / delphi-epidata

An open API for epidemiological data.
https://cmu-delphi.github.io/delphi-epidata/
MIT License
100 stars 62 forks source link

insert_or_update can have problems with multiple "issue"s #1427

Open melange396 opened 2 months ago

melange396 commented 2 months ago

insert_or_update() (which is really both insert_or_update_batch() and insert_or_update_bulk() right now; they should be resolved into one in #989) can have problems when given a set of rows that includes more than one issue. It is not that multiple issue values themselves are problematic, but the desired result may not be achieved when there are multiple rows for the same source/signal/time/geo that have different issues.

The problem stems from the fact that the SQL code in fix_is_latest_issue_sql kinda naively inherently assumes only one issue will be present for each (source, signal, geo_type, geo_value, time_type, time_value). It should be noted that it sort of presumes most rows inserted will be a latest issue, which is why the is_latest_issue column defaults to 1 in insert_into_loader_sql. It should also be noted that "overwrites" will happen when load.issue>latest.issue, but ALSO when they are ==, which lets us update/patch previously-written values that may have been incorrect (then the SQL that moves the load table data to latest and history tables has an ON DUPLICATE clause that preserves the unique key epimetric_id for the same unique composite key tuple).

In practice, the problematic behavior does not happen because the only time insert_or_update() is called (in csv_to_database.py:upload_archive()), it is handed rows returned by CsvImporter.load_csv(), which only uses a single issue value at a time. However, it is possible that we might write something else in the future that does not account for this expected invariance, so we should take steps to resolve it now.

Non-compliant calls to insert_or_update() can still happen in test code via calls to _insert_rows(). This github search provides a list of them.

Possible ways to address this:

Doing unit/integration testing for these conditions should be ~easy, and some work on this appears to have already been done in #925