cmu-delphi / delphi-epidata

An open API for epidemiological data.
https://cmu-delphi.github.io/delphi-epidata/
MIT License
100 stars 62 forks source link

previously unseen `geocoded_state` values broke covid_hosp_state_timeseries acquisition #1420

Closed melange396 closed 1 month ago

melange396 commented 2 months ago

The "Update COVID Hosp: state timeseries dataset" acquisition job has been failing since apr 4 2024; first failure in cronicle: https://cronicle-prod-01.delphi.cmu.edu/#JobDetails?id=jluljl1fnlb

Error message from cronicle:

{"url":"https://healthdata.gov/api/views/qqte-vkut/rows.csv","event":"fetching metadata","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:02.164180Z"}
{"newer_than":"Timestamp('2024-04-02 00:00:00')","older_than":"datetime.date(2024, 4, 5)","event":"looking up issues in date range","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:12.430783Z"}
{"newer_than":"2024-04-02 00:00:00","older_than":"2024-04-05","count":2,"event":"issues selected","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:12.434811Z"}
{"event":"already collected revision: https://us-dhhs-aa.s3.us-east-2.amazonaws.com/g62h-syeh_2024-04-03T12-06-56.csv","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:12.443171Z"}
{"event":"including dataset revision: https://us-dhhs-aa.s3.us-east-2.amazonaws.com/g62h-syeh_2024-04-04T07-01-48.csv","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:12.451599Z"}
{"url":"https://us-dhhs-aa.s3.us-east-2.amazonaws.com/g62h-syeh_2024-04-04T07-01-48.csv","event":"fetching dataset","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:12.451892Z"}
{"event":"2 issues checked w/ 2 revisions, resulting in 1 datasets.","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:16.100503Z"}
{"count":80201,"event":"updating values","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:16.193757Z"}
{"publ_date":20240404,"in_lines":[0,5000],"index":4999,"values":["PR",20210430,0.138728323699422,70,692,96,0.66804979253112,71,723,483,0,71,0,0,71,0,7,70,"POINT (-66.468953431 18.222606498)",2,70,1,70,0.052334784329244,70,10108,529,10285,71,6310,71,529,70,0.613514827418571,71,10285,6310,null,null,null,0.085144052792532,70,6213,529,25,0,71,1,71,3,71,6,71,6,71,4,71,2,71,3,71,71,0,71,18,0,70,0,70,0,70,3,70,3,70,2,70,4,70,3,70,70,3,70,4,70,0,70,12,70,0,70,0,70,null,null,null,483,71,96,70,67,71,479,70,451,71,25,70,50,70,25,70,723,71],"exception":"DataError(1406, \"1406 (22001): Data too long for column 'geocoded_state' at row 1\", '22001')","event":"error on insert","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"error","pid":92518,"timestamp":"2024-04-04T18:00:18.773557Z"}
{"scope":"sys","event":"Top-level exception occurred","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"error","pid":92518,"timestamp":"2024-04-04T18:00:18.773974Z","exception":"Traceback (most recent call last):\n  File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/connection_cext.py\", line 487, in cmd_query\n    self._cmysql.query(query,\n_mysql_connector.MySQLInterfaceError: Data too long for column 'geocoded_state' at row 1\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/runpy.py\", line 193, in _run_module_as_main\n    return _run_code(code, main_globals, None,\n  File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/runpy.py\", line 86, in _run_code\n    exec(code, run_globals)\n  File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/state_timeseries/update.py\", line 42, in <module>\n    Utils.launch_if_main(Update.run, __name__)\n  File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/utils.py\", line 38, in launch_if_main\n    entrypoint()\n  File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/state_timeseries/update.py\", line 38, in run\n    return Utils.update_dataset(Database, network)\n  File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/utils.py\", line 262, in update_dataset\n    db.insert_dataset(issue_int, dataset, logger=logger)\n  File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/database.py\", line 231, in insert_dataset\n    raise e\n  File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/database.py\", line 225, in insert_dataset\n    cursor.executemany(sql, many_values)\n  File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/cursor_cext.py\", line 357, in executemany\n    self.execute(operation, params)\n  File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/cursor_cext.py\", line 264, in execute\n    result = self._cnx.cmd_query(stmt, raw=self._raw,\n  File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/connection_cext.py\", line 491, in cmd_query\n    raise errors.get_mysql_exception(exc.errno, msg=exc.msg,\nmysql.connector.errors.DataError: 1406 (22001): Data too long for column 'geocoded_state' at row 1"}

The offending csv file is https://us-dhhs-aa.s3.us-east-2.amazonaws.com/g62h-syeh_2024-04-04T07-01-48.csv

In our TABLE covid_hosp_state_timeseries, COLUMN geocoded_state is of type VARCHAR(32).

The most recently successfully imported csv has nulls/empties for geocoded_state column, as does the csv issued immediately after the offending csv file, as does the most recently issued csv. Only the one csv appears to have these long "POINT (..." values. I downloaded these 4 mentioned csv files and ran the following commands to demonstrate:

$ ls -laF g62h-syeh_2024-04-*
-rw-r--r--  1 haff  staff  37787646 Apr 22 13:52 g62h-syeh_2024-04-03T12-06-56.csv
-rw-r--r--  1 haff  staff  40484359 Apr 22 13:53 g62h-syeh_2024-04-04T07-01-48.csv
-rw-r--r--  1 haff  staff  37961499 Apr 22 14:07 g62h-syeh_2024-04-11T15-31-38.csv
-rw-r--r--  1 haff  staff  38136294 Apr 22 14:06 g62h-syeh_2024-04-19T12-06-57.csv

$ grep -l POINT g62h-syeh_2024-04-*
g62h-syeh_2024-04-04T07-01-48.csv

$ grep "PR,2021/04/30" g62h-syeh_2024-04-03T12-06-56.csv 
PR,2021/04/30,0,0,71,0,0,71,2,70,10285,71,6310,71,529,70,25,71,18,70,0,70,12,70,483,71,96,70,67,71,479,70,451,71,50,70,25,70,723,71,0.613514827418571,71,6310,10285,0.085144052792532,70,529,6213,0.052334784329244,70,529,10108,0.138728323699422,70,96,692,0.66804979253112,71,483,723,,0,71,1,71,3,71,6,71,6,71,4,71,2,71,3,71,0,71,0,70,0,70,0,70,3,70,3,70,2,70,4,70,3,70,3,70,7,70,,,,,,,1,70,4,70,0,70,0,70,25,13,70,70,62,8,196,8,,0,,0,,0,,0,,0,12,8,29,8

$ grep "PR,2021/04/30" g62h-syeh_2024-04-04T07-01-48.csv 
PR,2021/04/30,0,0,71,0,0,71,2,70,10285,71,6310,71,529,70,25,71,18,70,0,70,12,70,483,71,96,70,67,71,479,70,451,71,50,70,25,70,723,71,0.613514827418571,71,6310,10285,0.085144052792532,70,529,6213,0.052334784329244,70,529,10108,0.138728323699422,70,96,692,0.66804979253112,71,483,723,POINT (-66.468953431 18.222606498),0,71,1,71,3,71,6,71,6,71,4,71,2,71,3,71,0,71,0,70,0,70,0,70,3,70,3,70,2,70,4,70,3,70,3,70,7,70,,,,,,,1,70,4,70,0,70,0,70,25,13,70,70,62,8,196,8,,0,,0,,0,,0,,0,12,8,29,8

Attempt to skip the offending file similar to solution in https://github.com/cmu-delphi/delphi-epidata/issues/1369 :

INSERT INTO covid_hosp_meta (dataset_name, publication_date, revision_timestamp, hhs_dataset_id, acquisition_datetime, metadata_json) 
VALUES ('covid_hosp_state_timeseries', 20240404, 'https://us-dhhs-aa.s3.us-east-2.amazonaws.com/g62h-syeh_2024-04-04T07-01-48.csv', 'g62h-syeh', 
        '2024-04-21 00:00:00', '{"error":"semi-malformed data file, includes geo point encodings too long for our data column: geocoded_state"}'); 

After adding that line to the covid_hosp_meta table above, i re-ran the cronicle job and am now awaiting the results.

melange396 commented 2 months ago

some discussion of this took place on slack: https://delphi-org.slack.com/archives/C013AH5N01E/p1713806839980759

melange396 commented 2 months ago

Cronicle job was successful after ~7.5 minute runtime: https://cronicle-prod-01.delphi.cmu.edu/#JobDetails?id=jlvbb7xtax6

Now waiting for the subsequently triggered "HHS" indicator job: https://cronicle-prod-01.delphi.cmu.edu/#JobDetails?id=jlvbbhj50x7 ...which should itself trigger a regular acquisition job.

melange396 commented 2 months ago

All done. Here is our dashboard showing hhs data for apr 12: https://delphi.cmu.edu/covidcast/indicator/?sensor=hhs-confirmed_admissions_influenza_1d_prop_7dav&date=20240412