cmu-delphi / delphi-epidata

An open API for epidemiological data.
https://cmu-delphi.github.io/delphi-epidata/
MIT License
101 stars 63 forks source link

Clean covid_hosp_facility hospital_subtype data #1007

Open krivard opened 1 year ago

krivard commented 1 year ago

I was looking for a complete list of all hospital subtypes for the documentation update, and found some alarming miscoded data:

mysql>  select distinct hospital_subtype from covid_hosp_facility;
[...]
| 34011F                                                           |
| 457736                                                           |
| 05T782                                                           |
| 405025                                                           |
| 676622                                                           |
| 8dcb4c99-fb3e-4c99-95d7-bdcf1253268d                             |
| 230118                                                           |
| 9f7c7842-12dc-4d6e-b4f6-a864ed471d99                             |
| 02fcc4218077f123aea96b1ccee7688249ed732178bd06ef73d8b227533bc19a |
| 140250                                                           |
| 17A020                                                           |
| 320016                                                           |
| 362035                                                           |
| 36T059                                                           |
| 555143                                                           |
| 677297                                                           |
| dead8933dab2783b14e9cb678ac9c55a70cb9ff5d69cf3dc4cce866ab52e7d8a |
| 040155                                                           |
| 140307                                                           |
| 1be1e598889c91e1d8798795da6e7883cf2b2e8acb4311563c474c54c7322bb0 |
| 333562                                                           |
| 390130                                                           |
| 39T142                                                           |
| 443459                                                           |
| 490001                                                           |
| 8148f640150c54b1ee6c0d275e2f907d61782176d6ee21fbce88fd076477d211 |
| e9947863f76de4aa0145ae11b279480dae1a31a303177d19b6d7bcd2c0ebac18 |
| b2937a1082df4731d7489c512c9155927bb390a498502997939ac4dd8308ffc0 |
| 110252                                                           |
| 370243                                                           |
| e1a5f1ded22a21b1cf6d9d5ff579edbecd8492f810dfbb6108a545f59e758c6c |
+------------------------------------------------------------------+
5109 rows in set (5 min 20.12 sec)

This should have been something like Childrens Hospitals, Critical Access Hospitals, Long Term, Psychiatric, Rehabilitation, Short Term; the actual values look more like hospital_pks (CCNs where the hospital has one, a giant hash if they don't).

We should

krivard commented 1 year ago

Starting id: 12004206 Ending id: 19561068 Affected rows: 146867 Currently active: No

Additional details (affected publication dates, range of collection weeks, etc) below:

mysql> select id, publication_date as pd, collection_week as cw, address, ccn, city, fips_code as fips, geocoded_hospital_address as gha, hhs_ids as hhs, hospital_name, hospital_pk, hospital_subtype as subtype, is_metro_micro as m, state as st, zip from covid_hosp_facility where id>11309140 limit 10;
+----------+----------+----------+-----------------------------------------------+----------+-----------+-------+------+------+----------------------------------------+--------------------+------------+------+----+-------+
| id       | pd       | cw       | address                                       | ccn      | city      | fips  | gha  | hhs  | hospital_name                          | hospital_pk        | subtype    | m    | st | zip   |
+----------+----------+----------+-----------------------------------------------+----------+-----------+-------+------+------+----------------------------------------+--------------------+------------+------+----+-------+
| 11309141 | 20211018 | 20200731 | CALLE ISAAC GONZALEZ ESQUINA LEDESMA          | 400130   | BDA NUEVA | 72141 | NULL | NULL | HOSPITAL METROPOLITANO DE LA MONTANA   | 400130             | Short Term |    1 | PR | 00641 |
| 11309142 | 20211018 | 20200731 | CARR. 2 KM 39.5    ROAD NUMBER 2 BO ALGARROBO | 400115   | VEGA BAJA | 72145 | NULL | NULL | CENTRO MEDICO WILMA N VAZQUEZ          | 400115             | Short Term |    1 | PR | 00693 |
| 11309143 | 20211018 | 20200731 | ROAD 128 KM 1.0                               | 400110   | YAUCO     | 72153 | NULL | NULL | HOSPITAL METROPOLITANO DR TITO MATTEI  | 400110             | Short Term |    1 | PR | 00698 |
| 11309144 | 20211018 | 20200731 | #4007 EST DIAMOND RUBY, CHRISTIANSTED         | 480002   | ST CROIX  | 78010 | NULL | NULL | GOV JUAN F LUIS HOSPITAL & MEDICAL CTR | 480002             | Short Term |    0 | VI | 00820 |
| 11309145 | 20211018 | 20200731 | 9048 SUGAR ESTATE                             | 480001   | ST THOMAS | 78020 | NULL | NULL | ROY LESTER SCHNEIDER HOSPITAL,THE      | 480001             | Short Term |    0 | VI | 00801 |
| 12004206 | 20210503 |   670300 | POINT (-96.979761 32.475684)                  | 20210423 | NULL      | Short | 0    | 0    | 48139                                  | 1201 E HIGHWAY 287 | 670300     |    1 | MI | METHO |
| 12044473 | 20210503 |   451396 | POINT (-98.677996 30.743345)                  | 20210226 | NULL      | Criti | 0    | 0    | 48299                                  | 200 W OLLIE        | 451396     |    0 | LL | BAYLO |
| 12063510 | 20210503 |   362015 | POINT (-81.534829 41.395457)                  | 20210129 | NULL      | Long  | 0    | 0    | 39035                                  | 44 BLAINE AVENUE   | 362015     |    1 | BE | GRACE |
| 12065112 | 20210503 |   520213 | POINT (-88.109045 42.958212)                  | 20210129 | NULL      | Short | 0    | 0    | 55133                                  | 4805 S MOORLAND RD | 520213     |    1 | NE | FROED |
| 12070857 | 20210503 |    62014 | POINT (-104.979305 39.849815)                 | 20210115 | NULL      | Long  | 0    | 0    | 08001                                  | 8451 PEARL ST      | 062014     |    1 | TH | VIBRA |
+----------+----------+----------+-----------------------------------------------+----------+-----------+-------+------+------+----------------------------------------+--------------------+------------+------+----+-------+
10 rows in set (0.00 sec)

mysql> select id, publication_date as pd, collection_week as cw, address, ccn, city, fips_code as fips, geocoded_hospital_address as gha, hhs_ids as hhs, hospital_name, hospital_pk, hospital_subtype as subtype, is_metro_micro as m, state as st, zip from covid_hosp_facility where id>19561060 limit 10;
+----------+----------+----------+---------------------+----------+-------------+-------+------------------------------+-------------+----------------------+-----------------------------------------------+---------------------+------+----+-------+
| id       | pd       | cw       | address             | ccn      | city        | fips  | gha                          | hhs         | hospital_name        | hospital_pk                                   | subtype             | m    | st | zip   |
+----------+----------+----------+---------------------+----------+-------------+-------+------------------------------+-------------+----------------------+-----------------------------------------------+---------------------+------+----+-------+
| 19561061 | 20211018 |   400133 | NULL                | 20200731 | [C400133-A] | Short | NULL                         | NULL        | 72127                | CARR 844 KM 5 CUPEY BAJO                      | 400133              |    1 | SA | HIMA  |
| 19561062 | 20211018 |   400106 | NULL                | 20200731 | [C400106-A] | Short | NULL                         | NULL        | 72127                | CARRETERA 21 1785 URB LAS LOMAS               | 400106              |    1 | SA | METRO |
| 19561063 | 20211018 |   400130 | NULL                | 20200731 | [C400130-A] | Short | NULL                         | NULL        | 72141                | CALLE ISAAC GONZALEZ ESQUINA LEDESMA          | 400130              |    1 | BD | HOSPI |
| 19561064 | 20211018 |   400115 | NULL                | 20200731 | [C400115-A] | Short | NULL                         | NULL        | 72145                | CARR. 2 KM 39.5    ROAD NUMBER 2 BO ALGARROBO | 400115              |    1 | VE | CENTR |
| 19561065 | 20211018 |   400110 | NULL                | 20200731 | [C400110-A] | Short | NULL                         | NULL        | 72153                | ROAD 128 KM 1.0                               | 400110              |    1 | YA | HOSPI |
| 19561066 | 20211018 |   480002 | NULL                | 20200731 | [C480002-A] | Short | NULL                         | NULL        | 78010                | #4007 EST DIAMOND RUBY, CHRISTIANSTED         | 480002              |    0 | ST | GOV J |
| 19561067 | 20211018 |   480001 | NULL                | 20200731 | [C480001-A] | Short | NULL                         | NULL        | 78020                | 9048 SUGAR ESTATE                             | 480001              |    0 | ST | ROY L |
| 19561068 | 20211025 | 20211008 | 45 WEST 10TH STREET | 240063   | SAINT PAUL  | 27123 | POINT (-93.100127 44.949062) | [C240063-A] | ST JOSEPH'S HOSPITAL | 240063                                        | Short Term          |    1 | MN | 55102 |
| 19561069 | 20211025 | 20211008 | 2501 W 26TH ST      | 433300   | SIOUX FALLS | 46099 | POINT (-96.756208 43.529458) | [C433300-A] | LIFESCAPE            | 433300                                        | Childrens Hospitals |    1 | SD | 57105 |
| 19561070 | 20211025 | 20211008 | 3134 BRIARCRESRT DR | 670259   | BRYAN       | 48041 | POINT (-96.323992 30.666329) | [C670259-A] | CAPROCK HOSPITAL     | 670259                                        | Short Term          |    1 | TX | 77802 |
+----------+----------+----------+---------------------+----------+-------------+-------+------------------------------+-------------+----------------------+-----------------------------------------------+---------------------+------+----+-------+
10 rows in set (0.00 sec)

mysql> select count(1) from covid_hosp_facility where id between 12004206 and 19561060;
+----------+
| count(1) |
+----------+
|   146867 |
+----------+
1 row in set (0.10 sec)

mysql> select id, publication_date as pd, collection_week as cw, address, ccn, city, fips_code as fips, geocoded_hospital_address as gha, hhs_ids as hhs, hospital_name, hospital_pk, hospital_subtype as subtype, is_metro_micro as m, state as st, zip from covid_hosp_facility where id>44000000 order by -id limit 10;
+----------+----------+----------+----------------------------------+--------+----------------+-------+-------------------------------+-------------+------------------------------------------------------+------------------------------------------------------------------+------------+------+----+-------+
| id       | pd       | cw       | address                          | ccn    | city           | fips  | gha                           | hhs         | hospital_name                                        | hospital_pk                                                      | subtype    | m    | st | zip   |
+----------+----------+----------+----------------------------------+--------+----------------+-------+-------------------------------+-------------+------------------------------------------------------+------------------------------------------------------------------+------------+------+----+-------+
| 44581864 | 20221024 | 20200724 | NULL                             | NULL   | NULL           | NULL  | POINT (-149.798948 61.245888) | NULL        | Elmendorf                                            | e1a5f1ded22a21b1cf6d9d5ff579edbecd8492f810dfbb6108a545f59e758c6c | Short Term |    0 | AK | 99506 |
| 44581863 | 20221024 | 20200403 | NULL                             | NULL   | NULL           | NULL  | POINT (-91.763693 30.024779)  | NULL        | Iberia Medical Center- North                         | b2937a1082df4731d7489c512c9155927bb390a498502997939ac4dd8308ffc0 | Short Term |    0 | LA | 70563 |
| 44581862 | 20221024 | 20200515 | 37358 Market Place Dr            | NULL   | Prairieville   | 22005 | POINT (-90.978579 30.317488)  | NULL        | Prairieville Family Hospital                         | 764e4176df9f3b110d69f34074f855a151462207109d55c7115326ac3ad0dc5a | Short Term |    1 | LA | 70769 |
| 44581861 | 20221024 | 20200703 | 2600 EAST  PFLUGERVILLE  PARKWAY | 670128 | PFLUGERVILLE   | 48453 | POINT (-97.587253 30.448194)  | [C670128-A] | BAYLOR SCOTT & WHITE MEDICAL CENTER – PFLUGERVILLE   | 670128                                                           | Short Term |    1 | TX | 78660 |
| 44581860 | 20221024 | 20200522 | 2600 EAST  PFLUGERVILLE  PARKWAY | 670128 | PFLUGERVILLE   | 48453 | POINT (-97.587253 30.448194)  | [C670128-A] | BAYLOR SCOTT & WHITE MEDICAL CENTER – PFLUGERVILLE   | 670128                                                           | Short Term |    1 | TX | 78660 |
| 44581859 | 20221024 | 20200522 | 5252 WEST UNIVERSITY DRIVE       | 670082 | MC KINNEY      | 48085 | POINT (-96.681888 33.218394)  | [C670082-A] | BAYLOR SCOTT AND WHITE  MEDICAL CENTER  MCKINNEY     | 670082                                                           | Short Term |    1 | TX | 75071 |
| 44581858 | 20221024 | 20200619 | 850 W CENTRAL TEXAS EXPRESSWAY   | 670080 | HARKER HEIGHTS | 48027 | POINT (-97.683707 31.074292)  | [C670080-A] | SETON MEDICAL CENTER HARKER HEIGHTS                  | 670080                                                           | Short Term |    1 | TX | 76548 |
| 44581857 | 20221024 | 20200605 | 231 SOUTH COLLINS ROAD           | 670060 | SUNNYVALE      | 48113 | POINT (-96.567178 32.77834)   | [C670060-A] | BAYLOR SCOTT AND WHITE MEDICAL CENTER SUNNYVALE      | 670060                                                           | Short Term |    1 | TX | 75182 |
| 44581856 | 20221024 | 20200612 | 1401 MEDICAL PARKWAY             | 670043 | CEDAR PARK     | 48491 | POINT (-97.81231 30.53078)    | [C670043-A] | CEDAR PARK REGIONAL MEDICAL CENTER                   | 670043                                                           | Short Term |    1 | TX | 78613 |
| 44581855 | 20221024 | 20200522 | 201 SETON PARKWAY                | 670041 | ROUND ROCK     | 48491 | POINT (-97.65277 30.566626)   | [C670041-A] | ASCENSION SETON WILLIAMSON                           | 670041                                                           | Short Term |    1 | TX | 78664 |
+----------+----------+----------+----------------------------------+--------+----------------+-------+-------------------------------+-------------+------------------------------------------------------+------------------------------------------------------------------+------------+------+----+-------+
10 rows in set (0.87 sec)

mysql> select min(publication_date), max(publication_date), min(ccn), max(ccn) from covid_hosp_facility where id between 12004206 and 19561060;
+-----------------------+-----------------------+----------+----------+
| min(publication_date) | max(publication_date) | min(ccn) | max(ccn) |
+-----------------------+-----------------------+----------+----------+
|              20210503 |              20211018 | 20191227 | 20210924 |
+-----------------------+-----------------------+----------+----------+
1 row in set (0.19 sec)

mysql> select distinct publication_date from covid_hosp_facility where id between 12004206 and 19561060;
+------------------+
| publication_date |
+------------------+
|         20210503 |
|         20210508 |
|         20210510 |
|         20210513 |
|         20210517 |
|         20210524 |
|         20210601 |
|         20210607 |
|         20210615 |
|         20210621 |
|         20210628 |
|         20210705 |
|         20210712 |
|         20210719 |
|         20210726 |
|         20210802 |
|         20210809 |
|         20210816 |
|         20210823 |
|         20210830 |
|         20210902 |
|         20210907 |
|         20210913 |
|         20210920 |
|         20210921 |
|         20210927 |
|         20211004 |
|         20211011 |
|         20211018 |
+------------------+
29 rows in set (0.12 sec)
krivard commented 1 year ago

Test run of re-ingesting the affected issues completed in 10 hours

# time PYTHONPATH=. python repos/delphi/delphi-epidata/fill_covid_hosp_issues.py
fetching metadata at https://healthdata.gov/api/views/j4ip-wfsv/rows.csv
fetching dataset at https://healthdata.gov/api/views/j4ip-wfsv/rows.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-05-03T10-17-30.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-05-03T20-31-27.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-05-08T20-16-23.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-05-10T15-13-48.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-05-13T17-07-05.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-05-13T17-12-38.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-05-13T17-28-33.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-05-17T08-26-41.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-05-24T10-27-59.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-06-01T14-45-07.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-06-07T12-00-47.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-06-15T07-19-17.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-06-21T10-26-37.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-06-21T10-31-45.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-06-28T11-08-34.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-07-05T12-07-34.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-07-12T11-28-53.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-07-19T09-20-31.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-07-26T09-37-22.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-07-26T09-46-00.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-08-02T10-08-20.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-08-02T10-14-52.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-08-09T10-13-43.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-08-16T10-29-45.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-08-23T10-16-37.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-08-30T11-31-55.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-09-02T09-49-25.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-09-07T10-22-56.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-09-07T10-33-17.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-09-13T09-44-53.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-09-13T10-06-05.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-09-20T13-48-07.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-09-21T10-06-07.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-09-27T12-19-38.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-10-04T12-22-25.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-10-04T12-34-18.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-10-11T10-06-55.csv
fetching dataset at https://us-dhhs-aa.s3.us-east-2.amazonaws.com/anag-cw7u_2021-10-18T12-13-56.csv
successfully acquired 194414 rows
successfully acquired 203767 rows
successfully acquired 203767 rows
successfully acquired 203770 rows
successfully acquired 204423 rows
successfully acquired 209463 rows
successfully acquired 214484 rows
successfully acquired 219492 rows
successfully acquired 219492 rows
successfully acquired 229514 rows
successfully acquired 229514 rows
successfully acquired 239477 rows
successfully acquired 244473 rows
successfully acquired 244473 rows
successfully acquired 254499 rows
successfully acquired 259505 rows
successfully acquired 264501 rows
successfully acquired 264501 rows
successfully acquired 269519 rows
successfully acquired 274525 rows
successfully acquired 279549 rows
successfully acquired 284555 rows
successfully acquired 289561 rows
successfully acquired 378598 rows
successfully acquired 378598 rows
successfully acquired 378598 rows
successfully acquired 304651 rows
successfully acquired 304651 rows
successfully acquired 314733 rows

real    622m55.644s
user    511m52.562s
sys     4m11.595s
krivard commented 11 months ago

This will get auto-fixed when we do the final covid_hosp migration as part of the upcoming refactor.

krivard commented 11 months ago

Blocked on