CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.13k stars 18.43k forks source link

INCORRECT DATA FOR New York Fatalities #2257

Open lmirny opened 4 years ago

lmirny commented 4 years ago

The correct number of fatalities today is 12192 https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities?%3Aembed=yes&%3Atoolbar=no&%3Atabs=n

bvlaicu commented 4 years ago

The inflation is probably due to the new "Probable deaths" number added by NYC: https://www1.nyc.gov/site/doh/covid/covid-19-data.page

JChristensen commented 4 years ago

As anyone that understands measurement and data analysis knows, changing the definition mid-stream greatly reduces the value of the data for making informed decisions. We have all this fancy analysis but it's of little value if there are not good operational definitions for data collection that are held consistent. I certainly hope that Johns Hopkins is pushing with all their might to eliminate poor measurement decisions like this and maintain data quality. See also #2247

CalvinParis commented 4 years ago

Amen brother.

paolinic03 commented 4 years ago

@JChristensen I was saying the same thing yesterday. I am still trying to figure out the rationale for including "probable" cases in the death count now drastically increasing the total. I understand we need to account for all who have passed unfortunately but in terms of a strategy, do we have one?Cuomo talking yesterday seemed to glance over this and I am wondering if we even have a solid plan at all.

paolinic03 commented 4 years ago

Now, given the new definition. I think we should maintain consistent with the original definition but add another measure called "Possible deaths" for complete transparency and consistency.

  1. Confirmed deaths (tested positive and confirmed in hospital/ healthcare facilities)
  2. Possible deaths (never tested and died at home/ outside healthcare system)

Adding them together all of a sudden is a really poor decision.

JChristensen commented 4 years ago

Obviously the "true number" can vary wildly depending on the definition. To me, the actual definition is less important than it is to hold it strictly constant. Given that, I can perhaps believe that the reported data has some fairly constant proportional relationship to the "true number". At this point it is at least as important to understand trends as it is to know the true numbers. Without strictly enforced measurement definitions, this is impossible. Categorizing data like this is a very difficult task under the best conditions so I know this is a situation that requires an extraordinary amount of rigor. Any change to the measurement definition and the noise can easily overwhelm the signal.

paolinic03 commented 4 years ago

New York has been tracking the "possible cases" since March 11th and just decided to add that into the total count so they have already been categorizing the confirmed vs the possible. The problem is not every state is defining and tracking this way so when you have a time series including state and county data, you cant just add the totals for NYs definition and original definition still being used by other states. We either need a standardized approach for the data set, or be able to categorize totals by definition. Otherwise, we are aggregating and making decisions on combined logic which is obviously causing confusion in analysis and amongst media outlets.

paulavery1951 commented 4 years ago

I had been wondering what happened to the US daily deaths (calculated from the JHU cumulative data) which jumped from 2494 on 4/15 to 4591 on 4/16. Drilling down, I discovered that the entire effect comes from the changed reporting from the NYC area, which went up by ~2500 (NY state also moves up by ~2500). Without this change, US daily deaths for 4/16 is ~2000.

Here are the last 10 days of US deaths calculated from the data I downloaded from the JHU site last night. The 4/16 daily count is much larger than the rest.

name,date,cumulative,daily US,2020/04/07,12722,1939 US,2020/04/08,14695,1973 US,2020/04/09,16478,1783 US,2020/04/10,18586,2108 US,2020/04/11,20463,1877 US,2020/04/12,22020,1557 US,2020/04/13,23529,1509 US,2020/04/14,25832,2303 US,2020/04/15,28326,2494 US,2020/04/16,32917,4591 <=====

paolinic03 commented 4 years ago

Yes, we know, read above...

paulavery1951 commented 4 years ago

I did. I am commenting on its effect on the national death statistics.

Schiffasaurus commented 4 years ago

Is there any way to apply the deaths per day in NYC to the day they actually occurred? If all these "probable COVID deaths" occurred on April 16th, what about all the "probable deaths" from the prior dates? That would skyrocket the figures nationally and the "curve".

ghost commented 4 years ago

I also agree with this posible suggestion as it is important to add true cases (Possible and Actuals) as it will give a fair picture about the actual mortality rate. So if somebody has the power to get in touch directly with JHU this will be one alternative that will work for everybody: As @paolinic03 said: Confirmed deaths (tested positive and confirmed in hospital/ healthcare facilities) Possible deaths (never tested and died at home/ outside healthcare system)

paolinic03 commented 4 years ago

Is there any way to apply the deaths per day in NYC to the day they actually occurred? If all these "probable COVID deaths" occurred on April 16th, what about all the "probable deaths" from the prior dates? That would skyrocket the figures nationally and the "curve".

That’s the thing, they didn’t all occur on April 16th, they added all probable deaths together from March 11th-on and appended the amount to the confirmed deaths on April 16th as one big number.

rbracco commented 4 years ago

This is a huge problem. Changing your methodology for part of your dataset when doing a time-series analysis is a very bad idea. It is exacerbated by the fact that there are hundreds of analysts downstream using this data to draw conclusions that will no longer be valid.

I have been trying to raise this issue with JHU with no success. Has anyone succeeded in reaching those working on the project?

CalvinParis commented 4 years ago

Looks like they are roughly double counting probable deaths, which appear to have always been included by not broken out. 8,448 confirmed + 4,264 probable = 12,712 vs 17,131, an overcount of 4,419

Cases: 126,368
Hospitalized*: 33,079
Confirmed deaths: 8,448
Probable deaths: 4,264
Updated: April 18, 2:00 p.m
gbigliardi commented 4 years ago

this problem is blocking a lot of applications downnstream.. JHU, please resolve ... is BLOCKING all the downstream study about new york situation

paolinic03 commented 4 years ago

Makes you wonder...

ghost commented 4 years ago

As it was suggested before, why dont you guys stick with NY times Data for US? as it is more accurate and doesnt have the addition 'yet' of probable deaths ...

rbracco commented 4 years ago

I for one am going to switch to an alternate API, but it is essential that this is fixed as many downstream sources are unaware. See the following images showing this error propagating into top newspapers across the US. Note that the two things reported below never actually happened!

Wall Street Journal: image

Washington Post: image

paolinic03 commented 4 years ago

Yep, sure does look like it is “improving” now right? Artificially create an apex so it looks like we are on the down swing.

paulavery1951 commented 4 years ago

Unfortunately, the NYT site is showing 0 deaths for 4/17 and 4/18. I'm guessing they must be investigating, but others here might know better.

On the other hand, # daily cases seems normal, i.e. somewhat spiky but nothing crazy.

CalvinParis commented 4 years ago

I'm pretty sure that hey are double counting probable deaths, which appear to have always been included by in the total just not broken out. 8,811 confirmed + 4,429 probable = 13,240 vs 17,671 an over-count of 4,431

Cases: 129,788
Hospitalized*: 34,602
Confirmed deaths: 8,811
Probable deaths: 4,429
Updated: April 19, 1:30 p.m.
JChristensen commented 4 years ago

I for one am going to switch to an alternate API...

@rbracco if you're aware of a data source that tracks confirmed and probable deaths separately, would you share please?

JChristensen commented 4 years ago

To be fair, the inclusion of "probable" deaths seems to be the result of a CDC recommendation. I for one have no idea what input JHU may have had on that, if any. I also don't know how data collection procedures are communicated, or whether the CDC recommendation included instructions to keep the statistics separate.

rbracco commented 4 years ago

@JChristensen I would recommend https://covidtracking.com/api but they don't list probable deaths. NYC lists the data separately as reported here (no api): https://www1.nyc.gov/assets/doh/downloads/pdf/imm/covid-19-deaths-confirmed-probable-daily-04192020.pdf It appears there is no great solution yet, but I'm sure there will be within a week.

The primary issue as I see it isn't that they started including probable deaths, but they appear to have done so erroneously. It looks like they labeled probable deaths from many past dates as having occurred on April 16th. This ruins any attempts at trend analysis. Also the numbers have been completely inaccurate since then.

image

New deaths per day in NYC (the source relies on JH CSSE) image

cpyic commented 4 years ago

Not sure whether one has noticed. But they have many entries of multiple states marked as "unassigned" and "out of" in the time series file. the problem might be related to these entries that did not belong to general FIPs nor a specific city. This could be a valid explanation if they have data from multiple sources, though I am not familiar how those numbers could be compiled together.

If we could be safely sure that the total numbers of cases came from the sum by state, from the time series, then you would be able to find the summed death from NY state as 17671 on 4/18.

For some reason, an entry of death update was coded into the FIPS=90036. If this interpretation is not incorrect, then folks might benefit from utilizing the time series file. Since if they have captured all sources, then there they should be. In the time series it seemed to be correct if you add this "unassigned" FIPS=90036 entry to the NY state total. Hope this helps.

image image

rbracco commented 4 years ago

Good news, this particular error was fixed via the commit listed below. Thanks to all who helped track this down.

https://github.com/CSSEGISandData/COVID-19/commit/0fa65c7f62580a4e1f7ef32659d1257bc8924ad8

brad255 commented 4 years ago

For 5/15, JHU reports 27,878 while NYS DOH reports 22,478 a JHU positive diff of 5,400. Any views on this diff?

cpyic commented 4 years ago

Could it be "probable deaths" from NYC reported separately? https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/ has reported 27k as well.

brad255 commented 4 years ago

The official DOH page doesn't include probable deaths which would seem to be a possible explanation. https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities?%3Aembed=yes&%3Atabs=n&%3Atoolbar=no