covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

scraped NY Times data flawed #447

Closed jzohrab closed 4 years ago

jzohrab commented 4 years ago

Original issue https://github.com/covidatlas/coronadatascraper/issues/978, transferred here on Thursday May 07, 2020 at 14:43 GMT


US county data differ from those in the New York Times source file (https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv).

E.g. Providence County, Rhode Island: 2020-04-29 - 3431 cases in your data 2020-04-29 - 5967 cases in NY Times data

I don't know if other counties are affected as well.

jzohrab commented 4 years ago

(Transferred comment)

Thanks for the issue!

We scrape multiple sources and cross check them. It’s possible that another source took precedence over the NYT one.

Is this still occurring? Cheers! Jz

El El jue, may. 7, 2020 a la(s) 10:44 a. m., hannahklauber < notifications@github.com> escribió:

US county data differ from those in the New York Times source file ( https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv ).

E.g. Providence County, Rhode Island: 2020-04-29 - 3431 cases in your data 2020-04-29 - 5967 cases in NY Times data

I don't know if other counties are affected as well.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/covidatlas/coronadatascraper/issues/978, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMPWDOET2EEZQD26UT4DYDRQLCLPANCNFSM4M3MM2OQ .

jzohrab commented 4 years ago

(Transferred comment)

Thank you for building up this great database!

The issue is still occurring.

Best, Hannah

jzohrab commented 4 years ago

(Transferred comment)

It's possible that RI is reporting current, not cumulative? Because their own website very clearly says 3,913 for Providence... which is less than yesterday, wtf? https://ri-department-of-health-covid-19-data-rihealth.hub.arcgis.com/

jzohrab commented 4 years ago

(Transferred comment)

Reached out to RI, they said;

Good morning,

Thank you for reaching out. The data is updated every day and cumulative.

Best,

Isabella COVID-19 Joint Information Center

With that, it does seem that NYT and JHU are wrong, or are counting data differently somehow... Maybe it has to do with RI reporting at a city level for some places?

jzohrab commented 4 years ago

(Transferred comment)

This is a common NYT problem ... they're counting higher for other locations too. Perhaps we shouldn't use them.