CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.11k stars 18.39k forks source link

Mass Validation Project - Lots of issues found - Don't know best way to report / fix #2291

Open Marcotte67 opened 4 years ago

Marcotte67 commented 4 years ago

I took the US Confirmed Cases file and looked at the counts from 3/14 to 4/18 for each county.
I looked for the following possible error conditions on any of the days: ERROR-1 (Total Cases Decreased from the prior day) ERROR-2 (Total Cases > 50, but no new cases) ------ removed from error selection rules ERROR-3 (Total Cases=0, but there were cases >0 on prior days) ERROR-4 (Total Cases >5, R2 (compared to NYC) < 80%) I also created 'sparkline' views in Excel to easily visualize the data errors.

I found 773 out of 3259 records with one of the above errors. But, I don't know what to do next. What is the current process for fixing historically incorrect data? Is there a QA data team? I posted my validation file for others to reference.

Marcotte67 commented 4 years ago

COVID19 - Data Validation - 2020-04-19.xlsx

cpyic commented 4 years ago

Just of note. There seemed to be multiple rows with FIPS that begin with 8 or 9, with admin2 as "unassigned" and "out of xx" which I suspect might be county data that came from different sources etc. It would be hard to explain the decrease in number. On the state level, the sum of these counties might still do for total confirmed cases. But at the county/city level, some counties might have been split into multiple rows of data, and therefore these "unassigned" and "out of xx" county names. Not sure if this helps.

Marcotte67 commented 4 years ago

I think that when you take the sum of all counties to get each state total, it just obscures the issues at the county level.

The unassigned counties probably represents ‘unincorporated’ sections of each state which don’t have fire/police coverage.

I “repurposed” the initial key columns for the error checking. I use the ‘combined name’ as the unique key.

Anyway, I do believe that there are many significant historical data issues which need to be corrected (within the subset I isolated).

I was trying to identify all of them at one time instead of reporting them piecemeal.

The easiest way to see issues visually is to look at the tab with the spark lines in column cv(ish).

The total number of cases for a county should never be less than the day before. (Error type 1)

The reason I did the mass audit was because I had created a dashboard to graph cases/deaths at the county level and was seeing a lot of noise which was screwing up my forecasting trend lines.

What do you suggest?

Thanks!

Wishing you and your family good health!

Stephanie Geman-Marcotte Geman_Marcotte@yahoo.com 847-508-4843 https://www.linkedin.com/in/stephanie-geman-marcotte-7695115/ Skype - Geman_marcotte

On Apr 19, 2020, at 9:07 PM, cpyic notifications@github.com wrote:

 Just of note. There seemed to be multiple rows with FIPS that begin with 8 or 9, with admin2 as "unassigned" and "out of xx" which I suspect might be county data that came from different sources etc. It would be hard to explain the decrease in number. On the state level, the sum of these counties might still do for total confirmed cases. But at the county/city level, some counties might have been split into multiple rows of data, and therefore these "unassigned" and "out of xx" county names. Not sure if this helps.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Marcotte67 commented 4 years ago

As an example: Look at Illinois, Adams County The US Confirmed .csv file has 4/7/2020 = 9, 4/8/2020 = 1, 4/9/2020 = 22

However, when I check this site... I see 4/8/2020 = 11 (not 1) http://dph.illinois.gov/covid19/covid19-statistics

It's a pain in the ass to check each questionable value this way, especially since the numbers are being restated all the time based on new categorization methodologies and missed cases, but I'm pretty sure it is the right thing to do if the data is going to be used for decision making purposes.

cpyic commented 4 years ago

Hi Stephanie, Thanks for the reply! And I hope that you and your family are all staying safe and protected during all this as well. I can see the frustration. It seems that this problem is very evident in recent week for the NYC data, since the "unassigned" county from NY state has recorded 500+ and 1000+ deaths on 4/17 and 4/18 while there has been zero increase in NY city death data. It would be worthwhile to start validating numbers as you had done from counties with larger case numbers and available data from government reports. I would hope to be able to trace out some patterns, for instance, not adding up from the previous days etc, so one could just add up for case numbers in latter days. I guess that in the example you showed it might be a typo. I agree that valid numbers are important for policy making especially at this moment. I might be able to look further later today. Is there a specific state that you are focusing on? Like NY, NJ?

Best, Teresa Chiang (cpytaa@gmail.com)

cpyic commented 4 years ago

@Marcotte67 Hi Stephanie, an user shared with me this website that has county-based data. I guess your might be interested since they received updates directly from county. https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/ Some states were changing reporting methods for deaths and added another layer of confusion. Hope this helps.

Marcotte67 commented 4 years ago

Hi, thanks for following up! I am interested in Illinois data, but can pull it all myself from the state site which has data at the county and zip code level for every date. And since the data is continually being restated for historical dates, it is necessary to pull the entire thing every day to update my "database". But, I want to be able to pull values from a single source for every county and country, otherwise it is too painful to keep up with.

Anyway, the point of my project (which is just curiosity), is to compare case/death growth at the county level. Why do some counties grow exponentially faster than others? Is it always dependent on population size? Population density? Likelihood to follow distancing orders? Why do the CFR's differ by county - is it due to quality of healthcare? % of overweight people? Median income? Median age?

And most importantly, I'm trying to estimate the total death count by July. If we have hit the US peak, will the number of deaths on the right side of the curve be double the deaths going up? Which would make the total about 120,000? What is the speed of the death decay by individual county? Why would this differ?

Anyway, even a little noise in the data makes this impossible. But, if I can use an algorithm to determine which counties shouldn't be used for analysis (set a flag).

I don't mean to be a pain, and I know that it is "virtually" impossible to keep up with the changes in data, difficult sources, lags, and reporting methodologies. So, thank you for even trying to tackle the project!