cmu-delphi / covidcast-indicators

Back end for producing indicators and loading them into the COVIDcast API.
https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html
MIT License
12 stars 17 forks source link

CDC case and death data by state #1392

Open nickreich opened 2 years ago

nickreich commented 2 years ago

It would be great if you all could incorporate this CDC case and death data source, so that its versioned history could be recorded. It has been brought to our attention at the COVID-19 Forecast Hub that this dataset has important differences with the JHU CSSE dataset that we use as "ground truth"

Data details

https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36

data source is made available as a CSV file. the important fields here are the fields that represent numbers of new cases and deaths.

krivard commented 2 years ago

I've set up a script to download the CSV daily while we put this together, just to retain a record.

Do you think Forecast Hub might be moving to using this source as a target in the future, or is this primarily for posterity?

It says the file gets updated twice daily; do you know if that's automated (so we can pick a fixed schedule to check) or manual (so if we pick a fixed schedule, we're more likely to miss some updates)?

nickreich commented 2 years ago

I don't know anything about the schedule of this. @stevemcconnell has been monitoring this dataset closely and might have a better idea.

I don't have a clear sense of the Forecast Hub use at this point. We might try to make a switch at some point, but likely would be months down the road.

stevemcconnell commented 2 years ago

I have not monitored the update frequency or timing closely. My sense has been that, as a government data source, the updates have been during government work hours, but that's just an impression. I believe that capturing a snapshot at midnight pacific time would be a reasonable approach to snapshotting this data set.

Versioning for this data seems at least as important as versioning the JHU CSSE data set. A high percentage of the data updates to this data set are updates that would be characterized as "backfills" on the JHU CSSE data set, i.e., many updates are for days that precede the current date.

nmdefries commented 2 years ago

This source reports states/territories, New York state without New York City, and New York City by itself (see "Number of Jurisdictions Reporting"). Other sources report NYC separately also, so we've had to deal with this before, but we don't have a standard approach.

For flu indicators in the past, we've reported NYC as a "state" and NY state minus NYC as a separate "state".

On the other hand, for USAFacts, we report NY state as the sum of NYC and the rest of the state, and then try to report reasonable NYC county-, HRR-, and MSA-level values by breaking down the NYC value in different ways.

This CDC source doesn't report any regions smaller than state (no zip codes, counties, or other cities), so the NYC-derived values would be the only ones available at finer resolutions, and thus wouldn't be particularly useful. They would be more intuitive to use/access.

Preferences for which approach to use?

stevemcconnell commented 2 years ago

@nmdefries As you pointed out, other data sources treat NYC and NY state differently, and the teams are used to handling that variation. The important thing in my mind would be to clearly state which approach you are using so that teams don't accidentally exclude the NYC data. Personally, I don't see much value in single-casing NYC as the only exception to state-level reporting, so my personal preference would be to combine them.