How are people dealing with US data?

aatishb commented 4 years ago

As of March 10, the data contains US cases at both state level AND county level. This is leading to double counting problems where if you sum all the US cases, you get a number that is roughly twice as high as the true number of cases.

See also #382, #472, #496, #559, #501, #541 and many more

The way I see it, for the US cases, we can either:

1. Focus on state data and ignore county based data

e.g., by filtering out commas in Province/State

2. Focus on county data and ignore state data

e.g., by filtering out state names in Province/State or filtering out Province/State without a comma

3. Do something else

e.g. combine these values in some useful way

How are people dealing with this? Are the state and county levels providing the same total numbers, or is one source more reliable than the other? I'm curious if anyone has a workaround for this.

lukesneeringer commented 4 years ago

I am just waiting for them to fix it, and my US visualizations are wrong right now.

nguyandy commented 4 years ago

I find that I am constantly fixing and updating my API because this dataset is so unpredictable.

My dashboard has been getting more traction than I expected. What started as a fun project for me is now slowly starting to becoming a full time job in order to maintain accuracy and functionality.

I implemented a little fix in order to fix the issue with the US states/cities basically by filtering out all the values containing a comma (since I believe cities will no longer be maintained)

https://covid19.nguy.dev/

treerunner commented 4 years ago

I spent 2 hours hacking through a solution last night but I am not confident with the results. I am not excited about the prospect of hacking a solution for cleaning data on a daily basis. I wish they could fix it. I am hoping they simply fix it.

gtestgit commented 4 years ago

For USA data, I extract STATE_CD's from the data (province State) before 3/9. Ref table in Database STATE_CD , STATE_NM. Data 3/10 and beyond. Just use State Names. Join data via STATE_NM to get time series back with complete History.

AndroidDev77 commented 4 years ago

@treerunner This is better than having only one or the other. This at least gives us options. I am mapping all of the data as it gives you a visual of outbreak locations and a the state total. Then only counting which one is greater State Cases or Counties in State Cases

j-fu commented 4 years ago

@aatishb - I am currently using method #1 but I see that the timeseries is inconsistent. So I guess this needs to be fixed by the maintainers. Kudos to them anyway for making this available!

aatishb commented 4 years ago

Looks like #590 addresses the double counting issue.

star-ops commented 4 years ago

Utilizing (OSINT) Open-source intelligence, techniques for Covid-19 research. An all hands on deck guide.

when doing OSINT we focus on Targets such as people or businesses but we can also use these same techniques for data collection on the virus

Part 1. intro to Osint https://www.reddit.com/r/OSINT/comments/e78he1/osint_for_beginners_part_1_introduction/

part 2. Tooling https://www.reddit.com/r/OSINT/comments/e7a4ke/part_2_tooling/

part 3. case/methods https://www.reddit.com/r/OSINT/comments/e9276y/osint_guide_part_3_case_management_and_methodology/

TASKS IN PREPARATION FOR THE COVID19 STUDY-A-THON

https://docs.google.com/document/d/1wD4qMy3jyNPXBOCEivkOqnXMteuWbF5yKenaZR6g57s/edit#

If you’re new to Coronavirus research, start here…

https://www.reddit.com/r/CoronavirusFOS/comments/f62xhx/if_youre_new_to_coronavirus_research_start_here/

SUMMARY OF SARS-CoV/SARS-CoV-2 AND COVID-19 FINDINGS

https://www.reddit.com/r/CoronavirusFOS/comments/fbmdhu/sarscov2_virus_characteristics_contains_sources/

my Github full of covid-19 data https://github.com/star-ops?tab=repositories

my https://www.mendeley.com/profiles/flynn-carsen/ with DOI research links might have to make a acc

On Thu, Mar 12, 2020 at 7:34 PM Aatish Bhatia notifications@github.com wrote:

Looks like #590 https://github.com/CSSEGISandData/COVID-19/issues/590 addresses the double counting issue.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CSSEGISandData/COVID-19/issues/571#issuecomment-598477616, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANRV7GII5NO6BYUNAHQK2BDRHFWR5ANCNFSM4LGP6CLQ .

dwstevens commented 4 years ago

I use some simple shell commands:

cat time_series_19-covid-Confirmed.csv | grep 'US' | grep -v ", "

For now it works.

kamermans commented 4 years ago

If someone is looking for a way to handle this in Golang (including the new county omission), here's how I'm doing it: https://gist.github.com/kamermans/397488317c75b23414100d7e1316e96f

ns-cweber commented 4 years ago

It's not just double-counting issues. In at least some cases, the state count doesn't include individual county counts. For example, on 1/24, Cook County, IL has 1 confirmed case but Illinois has 0.

CSSEGISandData / COVID-19

How are people dealing with US data? #571