CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.13k stars 18.42k forks source link

Adjustments to Time Series Data #590

Open CSSEGISandData opened 4 years ago

CSSEGISandData commented 4 years ago

We have removed the US county values from the 10th to present in regards to double counts from the US state level data.

tmeacham commented 4 years ago

Safe to assume the county cases were rolled up to a state level prior to 5/10 so that the time series doesn't graph as if suddenly hundreds of cases suddenly appeared on 5/10 on trend lines all over the internet?

Update: Turns out, not safe :) . Looks like the U.S suddenly had 893 cases on 5/10.

image

tmeacham commented 4 years ago

Deleting the values will helps with the duplication. In a perfect world, the county data would have been rolled up the to the new state level standard for dates prior to 5/10 and the county-level rows would be deleted from the dataset entirely as they are no longer tracked in that way. At least moving forward the US data should reflect accurate trends. Currently however, without knowing about the data collection errors in the dataset, a casual observer could be forgiven for thinking the US went from 0 to over 800 cases in one day.

image

AndroidDev77 commented 4 years ago

Why didn't you just stop populating the county values that way it wouldn't double count as you start entering data at a state level. But you still retain outbreak region history.

Itelina commented 4 years ago

Are you guys planning to repopulate the county level values? I found those to be super helpful, really drives home what this means for our local communities

tmeacham commented 4 years ago

@Itelina see #382 They are no longer tracking at the county level. As such it is best to delete those rows entirely.

Itelina commented 4 years ago

@tmeacham TY!

Jacoble1 commented 4 years ago

@Itelina see #382 They are no longer tracking at the county level. As such it is best to delete those rows entirely.

In specific states such as NY, WA & CA it would be wise to have County level data labels & groupings in the higher populated region (such as NYC's 5 counties & the surrounding suburban clusters in NYS like Rockland County, Nassau County, etc) if that's doable. The fact that infection's increasing at a rate which makes it difficult to report on a County level in certain is the very reason that those specific areas need county-based data groups.

aatishb commented 4 years ago

Thank you! This is very helpful.

aatishb commented 4 years ago

FWIW, for my own analysis, I'm going back and looking for US entries with a comma in 'Province/State', and replacing those values with the appropriate state, in order to backfill the empty state level data for dates prior to March 10.

cscollett commented 4 years ago

@aatishb I regex the state value out [A-Z]{2}$ and then use a State Abbreviation/Name lookup table (https://raw.githubusercontent.com/aruljohn/us-states/master/states.csv) to match the state name. DC needs an exception.

I just completed my multi-metro-area jupyter notebook which relies on the county data. I'm really bummed it's going away.

kamermans commented 4 years ago

If someone is looking for a way to handle this in Golang, here's how I'm doing it: https://gist.github.com/kamermans/397488317c75b23414100d7e1316e96f

grandave99 commented 4 years ago

It seems that the Confirmed data of Italy have no change (12462 Confirmed) between 11th March and 12th March. But the data released by WHO (https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports) claim changes.

badund commented 4 years ago

Deleting the county level values in a state like California where these figures are reported at the county level doesn't make sense.

How were those values collected before? Manually?

kendonB commented 4 years ago

Is there an alternative source anyone knows about for finer locations of cases in the USA?

kendonB commented 4 years ago

@CSSEGISandData are there any other examples of these sorts of changes in other countries? Specifically, are there any other countries for which you the spatial unit of the data changes at a point in time?

dawenx commented 4 years ago

@kendonB We are committed to provide county level statistics for US. Check my profile or #7

PCastleton commented 4 years ago

It would be better to drop the state level data and allow users to rollup county-level data to states or country. Now we're just losing granularity of spread.

lesham commented 4 years ago

are the 0's due to different aggregation, or is it possible that (some) geographical entities are reporting only new cases on the day, and not "total cases to date". There seems to be some confusion is various parts of the dataset.

PaulIPS commented 4 years ago

It would be great to see the county data fixed for the the US. State wide data doesn't really show the local impact especially in large states.

Has anyone found another data set with correct country data?

PySimpleGUI commented 4 years ago

CHOOSE 1 way to represent the USA data please.

Question - is the intention to list every county of every state if they have a case?

If so, the table will get large since there's 3,000+ counties in the USA.

If not, what are the rules for listing a county versus showing under a state total.

I'm struggling to parse the State field because, well, it's not just the state. Sometimes it's the state (spelled out), sometimes a county and a state abbreviation, and sometimes a city & state. It seems to be a "free form" field where anything goes.

If county level reporting is to be included, shouldn't it be another column or at least format the text following a rigid rule so it can be parsed?

Why can't the state abbreviation always be used? Example - getting the total for "North Carolina" means looking for both "North Carolina" and "NC" as both are used.

This data is going to get more complex and if it's this difficult to parse already, when the data grows 100 fold it's going to be not usable as it's already hard to parse.

I'm using the data to create grids of graphs for easy comparison. Selecting which countries / states is difficult if the data is not consistently formatted.

image

tomquisel commented 4 years ago

I found another source for US county-level data and started tracking it historically at the covid19-data repo. I hope it's a good substitute.

Jacoble1 commented 4 years ago

I found another source for US county-level data and started tracking it historically at the covid19-data repo. I hope it's a good substitute.

It's definitely helpful! Except for one thing... New York City is broken into 5 boroughs, each of which is it's own County. Manhattan - New York County Brooklyn - Kings County Queens - Queens County Staten Island - Richmond County Bronx - Bronx County

An increased number in Staten Island is significant in trends showing a correlation to Brooklyn or NJ (migration patterns) in the same manner that Bronx County versus New York County could determine if the trend of infection is truly localized or attributed to commuters. If that makes any sense from a non-expert like myself....

MaryELennon commented 4 years ago

Hey All! It looks like the last comment in here is from 6 days ago. In this window of time, the dashboard went from showing cases only at the state level back to showing county level data. Why is this? Is this county level information reliable and complete? I am trying to decide if this is something I an use and in the context of the above chain I am becoming quite confused. I will note that the related Tableau dash website, with it's data.world hub, is still showing state and not county level information.