Open rahmatiangit opened 4 years ago
Yea, total confirmed reported for NY state in the last file on 04/22 was 263,513 and on 04/23 it was 263,460. Seeing on 04/22 there were around 5000 confirmed cases, and there have been between 4000 and 5000 reported cases in the previous 3 days, 04/23 showing a decrease of -53 cases means there was likely incorrectly entered data for NY on 04/23.
For Onondage county, there was a decrease of 124, which can greatly bias statistical results.
That doesn't appear to make up the difference, the total confirmed cases for the state of NY look like this (new confirmed cases per day) ...
04/19: 5847 04/20: 4774 04/21: 5293 04/22: 5029 04/23: -53
It is statistically unlikely that NY was having a daily increase of nearly 4000-5000 per day and then dropping to -53 cases on the 23rd. Confirmed cases will never go down, its just a growing statistic, unless there are errors, but 5082 count of errors wipes out an entire days growth.
I checked and the New York City entry was 1442 less on the 23rd than on the 22nd. A total of 17 lines for New York were lower on the 23rd than reported on the 22nd.
I'm looking at the raw data in the files and seeing the decreases in NY by a large count. I'm using the final posted update from each day 22nd and 23rd from the repository history to do the comparison.
May be related to this ...
https://github.com/CSSEGISandData/COVID-19/issues/2361
Someone else noticed the 4/23 file has timestamps on updates from 4/22, so it could be older data from earlier in the day on the 22nd, like a rolling back of the values.
No. Onondaga county had never have 600 this number of confirmed cases. It had 598 in April 17 and 624 in April 18.
Point I am making is that there is a larger data point issue for the entire state including Onondaga. It's one thing if a county reports a correction, but 5000+ is a bit large for an error rate as it wipes out an entire days values.
Hence I think there is a data issue with the automation or their was a data issue with where they are getting the data from on that date.
And at some point, mattering on when you check the counts between the 17th and 18th then Onondaga would likely had been at 600 at some point in time during those 2 days. Bigger difference when an entire days count increase gets wiped out for the entire state for an entire 24 hour period.
It says that data source includes 1Point3Arces (https://coronavirus.1point3acres.com/en). However, data on 1Point3Arces is correct. And this mistake is only presented in NY.
Onondaga county only updates data once a day (around 4pm). So, 600 is a impossible number. So, I wonder where those number came from.
Does the https://coronavirus.1point3acres.com/en site let you see what the data was on 04/23 as provided by their data feeds? Might look right now, but could have been wrong when the automation pulled the data, then was corrected. Timing makes a difference.
If the data today is valid, and its just the data on 04/23 that is wrong in this repository then they likely won't fix it, and I will just have to find the first accurate data point after the correction to use for 04/23.
Ultimately the NY numbers in the github history for the 22nd vs the 23rd are off by nearly 4000-5000 cases for the state based on the last historical snapshot for each day (24 hour period). I just want to make sure that today's counts are not reflecting a larger issue.
598 vs 600 vs 624 your talking about less than a 5% error rate on the data in a 24 hour period, not likely going to affect analysis by a major degree.
There is definitely something off with the counts on those two dates in any event.
Based on what I observed in the JHU and the USAFacts COVID-19 data, it must be difficult to get 100% monotonically increasing data for cumulative cases across 3100+ counties in the US alone.
Volunteers can help clean up historical data. I ran the checking code in cell 4 here and found 31.9% of the county cumulative cases drop from one day to the next in the time series from 22nd of Jan. to 25th of Apr. (Between 22nd of Jan. to 18th of Apr, 25.4% of the "cumulative" case counts dipped from one day to the next.)
JHU CSSE Monotonicity Check (26th Apr, 2020)
I uploaded an API for people to do this sort of checks and plot graphs in a few lines of code. It also works with data from USAFacts.org. Their data come from county health departments listed on their website. They partitioned their data strictly by counties (as opposed to, say, lumping data from all five boroughs of NYC to Manhattan). They also seem to have population numbers that match those from the U.S. Census Bureau. I can't quite say that for the JHU data.
USAFacts.org Monotonicity Check (26th Apr, 2020)
Clearly, pointing out out problems is much easier than fixing them. I don't have the know-how to clean up the data for modeling purposes but I hope that there are volunteers here who can. I can at least provide the tools for them to analyze data from multiple sources. Let me know if you want to work on this.
FYI - so number of infections for New York State has not changed much from yesterday
NY_county | less/more infections on 4/23 than yesterday 4/22
-- | -- Albany | "+" Allegany | "=" Broome | "+" Cattaraugus | "=" Cayuga | "+" Chautauqua | "-" Chemung | "+" Chenango | "+" Clinton | "-" Columbia | "+" Cortland | "=" Delaware | "-" Dutchess | "+" Erie | "+" Essex | "=" Franklin | "=" Fulton | "=" Genesee | "+" Greene | "+" Hamilton | "=" Herkimer | "+" Jefferson | "+" Lewis | "=" Livingston | "+" Madison | "-" Monroe | "+" Montgomery | "+" Nassau | "+" New York | "-" Niagara | "+" Oneida | "+" Onondaga | "-" Ontario | "=" Orange | "-" Orleans | "+" Oswego | "-" Otsego | "+" Putnam | "-" Rensselaer | "-" Rockland | "+" Saratoga | "-" Schenectady | "-" Schoharie | "+" Schuyler | "-" Seneca | "=" Saint Lawrence | "+" Steuben | "-" Suffolk | "+" Sullivan | "-" Tioga | "+" Tompkins | "-" Ulster | "-" Warren | "+" Washington | "+" Wayne | "+" Westchester | "+" Wyoming | "-" Yates | "="
Big thanks for collecting all the data (even with some errors here and there)