CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.14k stars 18.44k forks source link

Michigan data on 6/6 is incorrect government website sending bad data #2671

Closed jdmsolarius closed 4 years ago

jdmsolarius commented 4 years ago

It seems the state of Michigan has changed its counting method to include probable cases and deaths.

All of these additional cases were simply added to the 6/6/2020 datapoint making it look like thousands of deaths occurred on the same day. The current method of simply adding this Data to the csv is incorrect

image

This changes core statistics like the Mean ,Median and variance and effectively makes this dataset worthless.

This has enormous implications for the consumption of this data, many statistics programs will eliminate this datapoint as an outlier as it fails many commonly used tests to check for outliers.

Hopkins response to this is incorrect This CSV needs to be split into two, you simply cannot keep the data in one file because a death on 6/4 != to a death on 6/8 but the implication of the csv is that they are.

The only other option I can think of is adding a Boolean column to the existing CSV callled “afterCriteriaChanges” or something.

jdmsolarius commented 4 years ago

Can we get some clarity on this? People rely on this to be accurate data the fact that the state of Michigan changed it’s accounting method on 6/6 must be accounted for or the data is meaningless for statistical purposes.

People look to Johns Hopkins as an accurate data-source for covid related data and it is not accurate that thousands of people died in Michigan on the sixth. The CSV states that thousands of people died on the sixth.

There are several alternative ways to deal with this that must be discussed

zurdibus commented 4 years ago

michigan.gov/coronavirus does not list probable data. The data in the spreadsheet https://www.michigan.gov/documents/coronavirus/Cases_by_County_by_Date_2020-06-12_152754_693759_7.xlsx does not have the data either and if you take a sum of the cases column it equals 59621 which is the number of confirmed cases... I agree its statistically a nightmare to just throw 6 thousand plus cases in there and hundreds of deaths. I suppose at least michigan provides the accurate data in a spreadsheet everyday, but that really isn't the point.

jdmsolarius commented 4 years ago

It’s not just a statistical nightmare it destroys the integrity of the data. We are putting data that we know for a fact is false and broadcasting it to the world. This data shows us that a calamity happened on the 6th.

Important interstate comparisons can no longer be done because important figures such as variance and mean are now way off. It’s better not to report data than to report data you know is false.

Also statistical programs will consuming the data will eliminate this data-point as an outlier causing even more problems.

Even if we’ve reached a dead end data-wise there are better ways from a data standpoint to deal with the spike then to add everything to 6/6. For example you could use a statistical method to retroactively apply them to data-points before the 6th. This is still bad but far better than just dumping them on a random day.

jdmsolarius commented 4 years ago

The fact that one of our data sources is giving us bad data does not mean we have the responsibility to just forward that bad data into our consumers. This site claims to draw data from dozens of data sources.

Jelfff commented 4 years ago

Look at the Michigan "unassigned" cases. 5,888 new unassigned cases were added on June 7. Then on June 9 the Michigan unassigned cases were reduced by 5,893. Does this help to fix the Michigan problem?

jdmsolarius commented 4 years ago

no it does not at all. This is the michigan problem here is a graph of the totalcases for michigan based on hopkins csv: image

what conclusions would you draw from this graph? I would say either A: my statistics software is detecting an anomaly and I am going to eliminate this as a one-time outlier or B: a calamity happened in Michigan on the sixth and a massive number of sick people were admitted to the hospitals and many died there

This is a data issue

jdmsolarius commented 4 years ago

I opened this ticket because we know that there aren't 5800 new cases on June 6th that didn't exist on the fifth. We know that a massive number of people didn't die on the 6th but that's what the data is telling our suppliers

It also throws off the datasets mean, variance, and other statistical variables and makes meaningful comparisons to other states meaningless and impossible.

One of our data-sources is giving Hopkins bad data and Hopkins essentially just forwarding that data to our consumers. This website is relied upon by people for accurate data, our role is not just to forward data that comes from government sources if it's not good data

The new deaths and cases in Michigan did not happen in one day, our CSV says they did, this is a problem the team has to address. Michigan adding probable cases is great, dumping them all into a single day though is not acceptable because the probable cases didn't all happen on the same day

RogerLustig commented 4 years ago

On top of all this, on 6/5 Allegan County's Confirmed (over 200) got switched with Alger County's Confirmed (0). They switched back on 6/9.

zurdibus commented 4 years ago

The state of michigan didn't change its counting method. Nowhere does it postulate probably new cases per day. Not a single place. It just adds the probable cases in as it sees fit for the purpose of giving as much data as possible. This dataset is tainted every day it uses it as a per day increase in cases, especially on the day it added them all.

The chart I have attached doesn't track specifically with the new cases per day because those are positive that day, but Michigan does track new cases by date of onset as well, and then retroactively adjusts it. What is coming out of this dataset now is nothing like what Michigan is reporting. Its just adding the probable case in error as new probable cases.

confirmed cases by date of onset

jdmsolarius commented 4 years ago

The salient point though is that if we know the data is tainted then Johns Hopkins has a responsibility not to report inaccurate covid data not reporting like its accurate like nothing is wrong.

Are we just spitting out a spreadsheet for the sake of spitting out a spreadsheet? Just because the data-source is government doesn’t abdicate the responsibility of Johns Hopkins to fact check.

People use Hopkins data to make important decisions, if a data source is giving tainted data Johns Hopkins should not be using it. The fact that the source is government is irrelevant.

The 6/6 datapoint is a lie, and we know it’s a lie and Johns Hopkins knows it’s a lie.

Why is a datapoint that everyone agrees is false included in the dataset? I mean literally everyone knows this is not a real datapoint but here it is just because data-source is .gov?

zurdibus commented 4 years ago

My point is the government isn't tainting it John's Hopkins has made an error in their assumptions. Data from michigan doesn't not include probable cases in their daily counts. Johns Hopkins is erroneously adding in probable data assuming its related in anyway to new daily counts. The data from Michigan is correct. They provide many csv files with the correct data every day.

jdmsolarius commented 4 years ago

Then this is something they need to fix to be an accurate data-source

Jelfff commented 4 years ago

Maybe this will help shed light.

Recently I finished code that converts the JHU cumulative case counts within the USA into daily case counts. My code produces 1 csv file per month beginning in March (data starts March 24).

The download link for June is https://mappingsupport.com/p2/disaster/coronavirus/daily_covid_cases/2020_06.csv To download other months, just change the number of the month.

The code runs each night (12:10a.m. pacific time) and updates the monthly master file with the number of new cases for the prior day. Anyone is welcome to use these csv files for any non-commercial purpose.

Look at the Michigan for June 6th thru 9th. Note all the zeros, the "unassigned" line and all the weird counts for June 9th.

There was a related 'issue' from a few days ago where someone from Hopkins explained that Michigan changed the webpage where the state reported data and that broke the code Hopkins was using to scape the data.

troymartinhughes commented 4 years ago

I believe much of the confusion in this thread results from a few incorrect assumptions (about JHU data and what they are intended to represent) rather than the alleged "tainted data" and so-called destroyed data integrity. JHU could alleviate much of this confusion by providing more accurate and in depth descriptions of its data, including an accurate data dictionary (which is lacking).

My observations of this thread include the following:

1. JHU "cases" reflect the date they are reported, NOT the "illness date" or date for onset of symptoms; similarly deaths are reflected when they are reported, not when they occur, so deaths (as reported by JHU) can lag weeks behind when a patient actually expired.

Thus, someone seeing a spike in cases or deaths (in JHU data) such as Michigan should take this at face value that these represent only new case REPORTS, and should not infer or misconstrue that they are recent infections. For example, the following assertion is incorrect because these new Michigan cases were never intended to represent newly infected persons: "I opened this ticket because we know that there aren't 5800 new cases on June 6th that didn't exist on the fifth."

CDC describes some of these data challenges: "These data represent the most accurate death counts. However, because it can take several weeks for death certificates to be submitted and processed, there is on average a delay of 1–2 weeks before they are reported. Therefore, the provisional death counts may not include all deaths that occurred during a given time period, especially for more recent periods. Death counts from earlier weeks are continually revised and may increase or decrease as new and updated death certificate data are received." (https://www.cdc.gov/coronavirus/2019-ncov/covid-data/faq-surveillance.html#Understanding-the-Data)

Finally, some local jurisdictions do not even report cases or deaths on certain days (e.g., Sundays), or report substantially fewer, given that staffing levels are reduced. This demonstrates why it is critical to rely on data smoothing techniques (such as 7-day moving averages) rather than raw case numbers for nearly all analyses.

  1. JHU updates the time series data but NOT its daily reports, as JHU clearly states "The daily reports will not be adjusted in these instances to maintain a record of raw data." (https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series)

Part of the issue here is the number of cases that are unassigned or incorrectly assigned to a specific county at the point of diagnosis, in which case the state or local jurisdiction must decide either to delay the reporting (which would be unacceptable) or to report it incorrectly initially (and hopefully change it later). In many cases, however, these "unknown" cases linger for weeks, and may never be assigned to a specific county; this is problem with the reporting jurisdiction, not JHU.

The NYTimes, which maintains its own GitHub repository that attempts to clean raw JHU, describes some of these issues: "Many state health departments choose to report cases separately when the patient’s county of residence is unknown or pending determination. In these instances, we record the county name as “Unknown.” As more information about these cases becomes available, the cumulative number of cases in “Unknown” counties may fluctuate." (https://github.com/nytimes/covid-19-data). The NYTimes describes further anecdotal complications, "When a resident of Florida died in Los Angeles, we recorded her death as having occurred in California rather than Florida, though officials in Florida counted her case in their own records. And when officials in some states reported new cases without immediately identifying where the patients were being treated, we attempted to add information about their locations later, once it became available."

Think of it this way: JHU is not monitoring individuals with COVID-19, but rather the daily cumulative number of COVID-19 cases and deaths per county or other region. JHU would not have the ability to uniquely identify persons and then correct those cases when local or state jurisdictions retroactively decide that a COVID-19 case (or death) belongs to a different county or even state.

A far greater concern than states like Michigan that are providing more accurate data (proactively, going forward), are the states such as Rhode Island and Georgia, in which a high percentage of their cases and deaths cannot be attributed to a specific county and are thus "Unassigned" or "Out of State."

3. Statistical programs will NOT throw out these data as outliers, because the high cumulative case counts are recurring and confirmed through successive high counts.

An outlier in cases could result if the cumulative number of Michigan cases spiked for a day, then the cumulative number decreased on the following day; however, this is not what happened here (although it has happened previously in JHU data). The Michigan case totals did increase greatly on a single day, but these high totals were reflected on subsequent days, indicating that they are reliable data, not outliers.

4. Spikes in reporting do NOT invalidate interstate analysis and comparison.

Interstate comparisons should be made with caution, as numerous states are not adhering strictly to CDC COVID-19 reporting guidelines, and as numerous states have changed how they report data over the course of the pandemic.

JHU describes some of these issues (including Michigan, at the heart of this discussion), stating: "June 11, Michigan, US. Michigan started to report probable cases and probable deaths on June 5. (Source) We combined the probable cases into the confirmed cases, and the probable deaths into the deaths. As a consequence, a spike with 5.5k+ cases is shown in our daily cases bar chart." (https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data)

JHU omits, however, other VERY significant changes in state or local reporting. For example, the NYTimes reports "On April 24, Massachusetts reported the results of a large number of backlogged tests performed by Quest Diagnostics dating back to April 13, leading to a large one day jump in the number of total cases." (https://github.com/nytimes/covid-19-data). This change in how Massachusetts reported COVID-19 data similarly caused a massive spike in JHU data.

I hope this provided some insight into assumptions and intent of the JHU data. I also hope, like everyone else, that it spurs JHU to better describe these data (within this GitHub repository) to help eliminate ambiguity about the interpretation and use of these data.

CSSEGISandData commented 4 years ago

All,

As described in #2704, through a collaboration with Michigan's Department of Health and Human Services we have been able to distribute the probable cases to the appropriate dates. The ability to redistribute this data is atypical and will likely not be the norm. As @troymartinhughes described above, generally only public reporting dates are available. In this case, a revision was made possible by the extraordinary efforts of the good people of MDHSS's Communicable Disease Division. We are grateful for their help. The US and global timeseries have been updated and the changes are reflected in the master branch.

Thank you