CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.13k stars 18.42k forks source link

Problem with US time series data #472

Open paldhous opened 4 years ago

paldhous commented 4 years ago

There seems to be a systematic problem in the latest release in which cases are undercounted on days prior to Mar 10, then suddenly appear under Mar 10. One result is that the total case count calculated for the US goes from 584 on Mar 9 to 1670 on Mar 10.

radixdev commented 4 years ago

I believe this may be the result of more frequent testing and reporting.

chiester commented 4 years ago

See #382

christruszkowski commented 4 years ago

Starting 2020-03-10 the data is being double counted in the US. Rows for counties and rows for states capture the same data

treerunner commented 4 years ago

Please take a look at the below graph of cases by country by date. Notice the jump in US cases over the last day? Is this due to state rows in the time series data including cases from county/city level? Or is this due to increased testing? Something tells me the numbers are still doubling up when I aggregate all US data over time. I wish the csv files could somehow include state/county.country level granularity for those that would like to zoom in, or not. Right now I do not know how to address this without scripting my way through the data (which increases chance of error).

One solution: If you are no longer reporting by city or county, simply dump those rows and roll the data into the appropriate column for the state. This way state will be accurate over time.

https://covid19-cases.herokuapp.com/

justindlongtx commented 4 years ago

I did a pull of the latest time series data and there doesn't appear to be double counting - I'm looking at the raw data and there are entries only for counties, not for states? But the numbers have taken a sizable jump, and when I sum all the "US" fields for 3/10, they come to 1,670, not 1,039 as is given on the main dashboard. There does seem to be a jump in the total # of reports for 3/10, so I'm pretty sure the 1,670 is right, but I don't understand why it doesn't match the number on the JH dashboard for the US.

aatishb commented 4 years ago

See #382. The issue (I believe) is double counting as the data is reported as state and county level, so if you sum all entries with Country/Region = US, you end up with an over count. Looks like this change is being debated.

aatishb commented 4 years ago

I did a pull of the latest time series data and there doesn't appear to be double counting - I'm looking at the raw data and there are entries only for counties, not for states? But the numbers have taken a sizable jump, and when I sum all the "US" fields for 3/10, they come to 1,670, not 1,039 as is given on the main dashboard. There does seem to be a jump in the total # of reports for 3/10, so I'm pretty sure the 1,670 is right, but I don't understand why it doesn't match the number on the JH dashboard for the US.

As of right now, there is definitely over counting. NYT is also reporting a number closer to 1,000 cases.

Compare:

Washington,US,47.4009,-121.4905,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,267
"King County, WA",US,47.6062,-122.3321,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,6,9,14,21,31,51,58,71,83,83,116

If you sum US fields you will be adding both state and county level data.

tmeacham commented 4 years ago

The issue seems to be with the change to reporting to state aggregates. The old location names were left in the time series file causing duplication. Ideally, the old names would be changed and aggregated accordingly to the new state level standard and duplicates removed.

At the moment I am just filtering out any location name with a ',' in it. This gives accurate "up to date data", but mucks up the time series.

justindlongtx commented 4 years ago

Yes, indeed. I found it now. Thank you.

justindlongtx commented 4 years ago

also note that the State data does not have aggregated data for each state before 3/10! I say this although most of you probably already realize this, but to save anyone else the silly realization I just made.

tmeacham commented 4 years ago

Yep, that is why I said it "mucks up" the time series. My hope is the historic locations will be updated to the new standard (state level only). Otherwise you will need a complex multistep process to return the data to sanity. Some other inconsistent naming issues I script around are here:

Iran (Islamic Republic of), Iran Mainland China, China Republic of Korea, South Korea, occupied Palestinian territory, Palestine Hong Kong SAR, Hong Kong Macao SAR, Macao Viet Nam, Vietnam Russian Federation, Russia Republic of Moldova, Moldova Taipei and environs, Taiwan Holy See, Vatican City

*Mainland China isn't an issue so much as my mapping software doesn't like that name.