Ingest and archive CDC hospitalization data

capnrefsmmat commented 4 years ago

From @ryantibs:

We should start ingesting the CDC NSHN Hospitalization Data, and the CDC COVID-NET hospitalization data. These appear to be two alternative CDC run surveillance system for state-level COVID hospitalizations. We should compare these against each other---in whatever ways are possible---and also to state level COVID hospitalization data collected from the state depts of health, through COVID Tracking.

From @RoniRos:

It's important to start this ASAP, because there is backfill (aka data revisioning), and no guarantee that anyone is storing the historical versions, which are critical for real-time forecasting.

Until we download and compare these, it won't be clear what exactly should go into the API, so we should

[ ] download the data regularly, archiving historical versions so we understand the backfill
[ ] compare the data from the different sources
[ ] determine which sources we want to use
[ ] determine if these should be in the API
[ ] put them in the API

This issue will just be for the first three steps; once we know what should go into the API and how, the Indicators team can plan what release this should go into.

brookslogan commented 4 years ago

Regarding COVID-NET:

Most similar is FluSurv-NET, which was implemented by David. This was a more involved source as it involved Flash AMF format and maybe some browser dev console monitoring to get running. Tailoring it would involve editing the locations and determining if their IDs in the system have changed, which might end up needing web monitoring, as well as mapping whatever their endpoint provides for dates to dates.

Downloading this manually will take a while; it requires selecting each location. Doing that now for today's / the current data.

eujing commented 4 years ago

I was checking out the way COVID-NET called their API to get the data and like what @brookslogan said it relies on a networkid and catchmentid for each location and POSTing to https://gis.cdc.gov/grasp/covid19_3_api/PostPhase03DownloadData.

However they also seem to have an endpoint that gives the mappings for all sorts of IDs, like networkid & catchmentid, the age group ids, the MMWR date ranges, etc.

url = "https://gis.cdc.gov/grasp/covid19_3_api/GetPhase03InitApp"
resp = requests.get(url, params={"appVersion": "Public"})
data = resp.json()

data["catchments"] gives us something like the following:

[{'networkid': 1,
  'name': 'COVID-NET',
  'area': 'Entire Network',
  'catchmentid': '22',
  'beginseasonid': 49,
  'endseasonid': 51},
 {'networkid': 2,
  'name': 'EIP',
  'area': 'California',
  'catchmentid': '1',
  'beginseasonid': 43,
  'endseasonid': 51},
 {'networkid': 2,
  'name': 'EIP',
  'area': 'Colorado',
  'catchmentid': '2',
  'beginseasonid': 43,
  'endseasonid': 51},
  ...

data["ages"] gives us something like the following:

[{'label': '40-49 yr',
  'ageid': 12,
  'parentid': 3,
  'color_hexvalue': '#70BE3B'},
 {'label': '30-39 yr',
  'ageid': 11,
  'parentid': 3,
  'color_hexvalue': '#70BE3B'},
  ...

data["mmwr"] gives us something like:

[{'mmwrid': 3036,
  'weekend': '2020-03-07',
  'weeknumber': 10,
  'weekstart': '2020-03-01',
  'year': 2020,
  'yearweek': 202010,
  'seasonid': 59,
  'label': 'Mar-07-2020',
  'weekendlabel': 'Mar 07, 2020',
  'weekendlabel2': 'Mar-07-2020'},
  ...

Just sharing what I hope might help with the downloading!

eujing commented 4 years ago

I started downloading the two datasets on my computer since yesterday 27 May: 1) COVID-NET: The initial JSON with all the ID mappings, and the DataDownload JSON for each network and each state that contains the hospitalization rates for all age groups. 2) NHSN: The aggregate NHSN COVID-19 Module data CSV

Also I compiled an initial summary with comparison of the 3 data sets including COVID Tracking in this document.

In general, estimated hosp. rate differs quite a bit between COVID-NET and COVID Tracking on states like New York and Connecticut, but are quite similar for states like Ohio and Oregon. In the meantime I am checking this to make sure the comparison done is right (for Maryland at least, as it has 100% participation from all its counties in COVID-NET unlike other states)

So far I find not obvious way to compare the NHSN data to the other datasets because it mainly deals with hospital / ICU utilization, which the others do not seem to have data about. Elaborated in more detail in the document.

RoniRos commented 4 years ago

Regarding NHSN Hospitalization data:

At a call with CDC today, they told us a bit more about it:

What is being reported is prevalence, not incidence. They are planning to convert it to estimated incidence based on Length of Stay distribution. LoS data will come from COVID-Net, which has it across ages & jurisdictions, but not for each state separately. They can share with us both the process & the end result.

The NHSN website shows the % of all hospitals that report into this system, per state.
This system is normally used by all hospitals that get Medicare reimbursement and that are not rural or critical care only (=no ICU). So basically cohortive reporting, but reporters change over time, a bit like ILINet. They are still trying to figure out how to weigh the hospitals over time, so they are representative.

There is actually a federal mandate for all hospitals to report, but each hospital can choose to report into NHSN or into some commercial org, like Teletracker. CDC also has access to Teletracker data, but it has run into data difficulties trying to integrate it. ;-(

krivard commented 4 years ago

[ ] Put a wip_ signal in the API for map testing
[ ] Write API docs

dfarrow0 commented 3 years ago

Any status changes since the May 28 update?

Is the data still being manually downloaded to someone's computer?

@eujing @krivard

eujing commented 3 years ago

@dfarrow0 I think the CDC COVID-NET data is being scraped and run as a (non-public?) indicator already as cdc_covidnet.

The NSHN hospitalization data is still being downloaded manually to my computer, but I think at the time it was not turned into an indicator as they seem to only report prevalence.

dfarrow0 commented 3 years ago

thanks @eujing !

@krivard you might want to close this issue out given https://github.com/cmu-delphi/covidcast-indicators/pull/79

krivard commented 3 years ago

No, COVID-NET is not being run. It's been merged to main, but has no deploy-* branch and has not been automated (or if it has, nobody told me...)

iirc the coverage was not high enough to deploy it for states, and we don't yet support HHS regions.

cmu-delphi / covidcast-indicators

Ingest and archive CDC hospitalization data #45