cmu-delphi / covidcast

R and Python packages supporting Delphi's COVIDcast effort.
https://delphi.cmu.edu/covidcast/
33 stars 28 forks source link

Missouri ("mo") data from 'jhu-csse' missing for early dates #96

Closed nickreich closed 3 years ago

nickreich commented 3 years ago

The following line of code returns no data

covidcast::covidcast_signal("jhu-csse", "deaths_cumulative_num", geo_type = "state", geo_values = "mo", start_day = "2020-03-10", end_day = "2020-06-14", as_of="2020-05-06")

And gives warning messages like

50: Fetching deaths_cumulative_num from jhu-csse for 20200428 in geography 'mo': no results

However data for "mo" was certainly available at this time, as can be seen from this file: https://github.com/CSSEGISandData/COVID-19/blob/476c78eb96eb2d34483daea4c2fc44f3b38bf847/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv#L1604

Can these early data for "mo" be added (maybe there are other locations missing too? this was just the one that we stumbled across), or a warning returned saying that data provided may not be complete?

capnrefsmmat commented 3 years ago

It looks like the earliest issue date available for this signal is 2020-05-07, so you won't be able to get the signal with as_of="2020-05-06". This is likely because we had to retroactively reconstruct the historical data from our database backups when we introduced revision tracking.

@eujing, do you recall what date range was available to us when we reconstructed JHU issues, and why May 7 is the first available date?

nickreich commented 3 years ago

I guess I had expected that the data in this would be a facsimile of data available as of the versions of the JHU data on GitHub, but sounds like that isn't the case. Is it documented somewhere what the "ground truth" is for each source?

capnrefsmmat commented 3 years ago

The ground truth for this source is the JHU data on GitHub. Every day, our pipeline downloads the latest CSV, parses all the geographies, and produces the signal you see in the API.

The problem is that as_of support requires a historical record of what data we ingested on a particular day. For example, if on May 7th JHU retroactively changes the count for Missouri on April 20th, asking for the data as_of May 6th should return the old count, not the new one.

We didn't publicly release tracking of signal history until July 26th. Before then, each download of JHU data simply replaced the old data in our API. By parsing our database backups, we were able to recover the history of all changes starting May 7th. We did this from backups, not from JHU's Git history, so we could do the same for all of our signals from various sources.

I'm just not sure why it was May 7th and not some other time, but we can find out.

nickreich commented 3 years ago

I see, thanks! that makes sense to me.

However, given that the public revision history is out there in the world to see, I suggest that you all consider, for a few sources (including JHU CSSE, which happens to be the source we at the COVID-19 Forecast Hub care about :-) ), tracing back the revision history based on the public record. As it is, your covidcast signal can't be comprehensive and authoritative for JHU CSSE without using that public data.

krivard commented 3 years ago

This work is under way in cmu-delphi/covidcast-indicators#23, and the geocoding refactor it's currently blocked on is expected to merge by early next week.

nickreich commented 3 years ago

awesome, thanks for the update! I'll close this issue for now.