beoutbreakprepared / nCoV2019

Location for summaries and analysis of data related to n-CoV 2019, first reported in Wuhan, China
MIT License
658 stars 257 forks source link

Is latestdata.csv incomplete/outdated? #44

Closed edend10 closed 1 year ago

edend10 commented 4 years ago

First of all, great work and thank you for providing this data!

I'm doing a simple aggregation on province, city and noticing the number isn't right for NYC. (using python and pandas)

import pandas as pd
df = pd.read_csv('latestdata.csv')[['province', 'city']]
df = df.groupby(['province', 'city']).count().reset_index()
df[df['city'] == 'New York City']

Aggregated count in NYC comes down to 2469, whereas today cases are reported to be 20K+

On https://www.healthmap.org/covid-19/ which references your data, they show 17K which is closer to the reported numbers (at the time I'm writing the latestdata.csv has been updated 16 hours ago so their gap sense). Although I don't know if their website is augmented by another data source.

Also, the latestdata.csv has a total of 117K rows, whereas reported cases in the world as per healthmap.org is 500K+.

Is something wrong with the way I'm looking at the data or could it be incomplete?

giustom01 commented 4 years ago

There is something wrong with the dataset, The entire dataset has 117k records, there are 550k cases, if you select on country = 'United States' it's only got 11,364 records and there are 86,242 cases in the USA so far. I hope this can be fixed, I really need the USA stats for my company. We're using this data to determine which offices to close, we are an essential business.

beoutbreakprepared commented 4 years ago

Thanks for flagging - we're looking into this. There is one processing step that might have led to some cases not being present when using those filters. I will caution however that this dataset is intended to be as comprehensive as is feasible for our team, with as much specific metadata available, rather than be aligned one-to-one with the total global cases. I would suggest looking at the Johns Hopkins Github https://github.com/CSSEGISandData/COVID-19 for a system that is engineered to track cumulative counts, without the metadata of age, sex, outbreak time milestones, or more specific geography. We will retain as much as is feasible for us to do so.

giustom01 commented 4 years ago

@beoutbreakprepared I'm fine with the entire dataset, we have global offices as well, just concentrating on USA for now. But having said that, the entire dataset only has 117k records, it should have well over 550k records and should be 600k+ after today.

dkori commented 4 years ago

If it helps diagnose the issue, the state of NY in the United States seems to have no cases confirmed for certain dates, while on other sites their reporting of cases seems to be pretty consistent day-to-day.

image

rjf777 commented 4 years ago

is it just my copy of the 'latestdata.csv' that only has 436,549 records of the 2,479,498 cases? Is there another section? ... the latest copy is just a fifth of the cases

tnjcook commented 4 years ago

Is there anymore data beyond june 16?

attwad commented 4 years ago

@calremmel is looking into why the data isn't up to date.

calremmel commented 4 years ago

Hi all, update on this: the line list data source that feeds latestdata.csv is not currently being updated, so the current file is as up to date as it is going to be until that changes.

My understanding is that we'll be migrating to some other sources soon. I don't know what that will mean for accessing this particular file where it is currently, but for the time being, the most recent lines are from mid-June.

tnjcook commented 4 years ago

Bummer - thanks for the update!

Get Outlook for iOShttps://aka.ms/o0ukef


From: calremmel notifications@github.com Sent: Friday, July 31, 2020 5:22:30 PM To: beoutbreakprepared/nCoV2019 nCoV2019@noreply.github.com Cc: tnjcook timnjencook@live.com; Comment comment@noreply.github.com Subject: Re: [beoutbreakprepared/nCoV2019] Is latestdata.csv incomplete/outdated? (#44)

Hi all, update on this: the line list data source that feeds latestdata.csv is not currently being updated, so the current file is as up to date as it is going to be until that changes.

My understanding is that we'll be migrating to some other sources soon. I don't know what that will mean for accessing this particular file where it is currently, but for the time being, the most recent lines are from mid-June.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/beoutbreakprepared/nCoV2019/issues/44#issuecomment-667177681, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APDIRWHUG2ARXCQHOSFZRQTR6LOTNANCNFSM4LUSNFIQ.