Open zbraniecki opened 4 years ago
And tested
I think as well.
In the report.json
for NY
I see:
country:"USA"
url:"https://covidtracking.com/api/states"
type:"json"
curators:Array[1]
aggregate:"state"
priority:-0.5
timeseries:false
headless:false
certValidation:true
state:"NY"
deaths:385
tested:122104
cases:37258
ssl:true
rating:0.49019607843137253
but in byLocation.json
, I see:
"2020-3-26": {
"cases": 37258,
"growthFactor": 1.209243452013891
}
Same for Texas
They're also reported with aggregation: "county"
. I think they're deduped against counties because all those states have both county and state with that name.
@lazd sorry to poke you, but I think this is pretty severe and I'd like to make sure it doesn't escape your attention before the next update. Can you mark it with appropriate labels?
No worries @zbraniecki. As part of covidatlas/coronadatascraper#410, I will try to get tested
data back via COVIDTracking. That said, I don't think we can get deaths
unless we have it somewhere.
COVIDTracking has deaths for those states:
Will that also fix it?
And covidatlas/coronadatascraper#410 seems to be about combining data from two sources. My suspicion is that this bug is about two sources for two different things (county vs. state) ending up conflated as two sources of the same thing and the county
one wins.
Here's what's in report.json for "NY, USA":
"NY, USA": [
{
"country": "USA",
"url": "https://covidtracking.com/api/states",
"type": "json",
"curators": [
{
"name": "The COVID Tracking Project",
"url": "https://covidtracking.com/",
"twitter": "@COVID19Tracking",
"github": "COVID19Tracking"
}
],
"aggregate": "state",
"priority": -0.5,
"timeseries": false,
"headless": false,
"certValidation": true,
"state": "NY",
"deaths": 385,
"tested": 122104,
"cases": 37258,
"ssl": true,
"rating": 0.49019607843137253
},
{
"state": "NY",
"country": "USA",
"type": "table",
"aggregate": "county",
"timeseries": false,
"headless": false,
"certValidation": true,
"priority": 0,
"url": "https://coronavirus.health.ny.gov/county-county-breakdown-positive-cases",
"cases": 37258,
"ssl": true,
"rating": 0.3137254901960784
}
],
I think this is a different thing than covidatlas/coronadatascraper#410. - mainly, those counties should not be conflated with states. and "NY, USA" should be a state and collect state data.
No, those counties cases are rolled up into state totals, which is exactly what we want to do. However, testing numbers aren't being reported on a per-county basis, so they're not getting rolled up.
So what we want to do is take our rolled up case numbers and take COVIDTracking's testing numbers, which is what covidatlas/coronadatascraper#410 is about.
Just as another data point, I'm still seeing deaths == nan for all of NY state and city.
# omitted: load `df` from `timeseries.csv`, parse dates, drop lat/lon/url columns
>> usall = df[(df["country"] == "USA")]
>> usall[(usall.deaths.notnull()) & (usall.deaths>0)].state.unique()
array(['WA', 'CA', 'MA', 'GA', 'FL', 'NJ', 'OR', 'IL', 'PA', 'IA', 'NC',
'SC', 'IN', 'KY', 'NV', 'OH', 'WI', 'CT', 'HI', 'OK', 'UT', 'KS',
'LA', 'MO', 'VT', 'AR', 'ID', 'ME', 'MI', 'MS', 'NM', 'ND', 'SD',
'CO', nan, 'VA', 'DC', 'AL', 'PR', 'GU', 'AK', 'MN'], dtype=object)
(Personally I'm less interested in tested, except to the extent that it's caused by the same underlying issue... reports on number tested have been inconsistent across most aggregators; cases+deaths have been more reliable).
Also, thanks for putting this dataset together; I've been lurking for a while and am impressed with the work y'all're putting in. Unfortunately I just migrated the daily updates I send to friends and family on cases+deaths (in the states we live in) to use this timeseries data; bad timing I guess :)
Good luck with the fix, and thanks again!
@mvanmidd we don't have a source for deaths in NY on a per-county basis: https://coronavirus.health.ny.gov/county-county-breakdown-positive-cases
NYC only notes deaths for the entire city: https://www1.nyc.gov/assets/doh/downloads/pdf/imm/covid-19-daily-data-summary.pdf
We can pull deaths for NYC from the daily update PDF, but we're out of luck for the rest of the New York counties. Until we implement covidatlas/coronadatascraper#410, we won't be pulling deaths for NY state either, unfortunately.
Gotcha, thanks for the update. covidatlas/coronadatascraper#410 seems like a big one, good luck! y'all are going to have a fully featured generic/configurable ETL framework pretty soon :)
In all seriousness, I think the auditable data aggregation is the biggest strength of this project... there's plenty of fronted work going on elsewhere (e.g. the explosion of "babby's first plotly visualizations," including my own), and on the backend, lots of data sources that are either incomplete or opaque. Keep up the good work!
For US state/county data, how about NYT repo: https://github.com/nytimes/covid-19-data ?
@cristipp - good find. I haven't looked at the actual data, but the README is encouraging.
Rows: 1884
Columns: date, state, fips, cases, deaths
[
{date: '2020-03-01', state: 'New York', fips: '36', cases: '1', deaths: '0'},
{date: '2020-03-02', state: 'New York', fips: '36', cases: '1', deaths: '0'},
{date: '2020-03-03', state: 'New York', fips: '36', cases: '2', deaths: '0'},
{date: '2020-03-04', state: 'New York', fips: '36', cases: '11', deaths: '0'},
{date: '2020-03-05', state: 'New York', fips: '36', cases: '22', deaths: '0'},
{date: '2020-03-06', state: 'New York', fips: '36', cases: '44', deaths: '0'},
{date: '2020-03-07', state: 'New York', fips: '36', cases: '89', deaths: '0'},
{date: '2020-03-08', state: 'New York', fips: '36', cases: '106', deaths: '0'},
{date: '2020-03-09', state: 'New York', fips: '36', cases: '142', deaths: '0'},
...
]
And for county level:
[
{date: '2020-03-28', county: 'Albany', state: 'New York', fips: '36001', cases: '195', deaths: '1'},
{date: '2020-03-29', county: 'Albany', state: 'New York', fips: '36001', cases: '205', deaths: '1'},
{date: '2020-03-30', county: 'Albany', state: 'New York', fips: '36001', cases: '217', deaths: '1'},
{date: '2020-03-31', county: 'Albany', state: 'New York', fips: '36001', cases: '226', deaths: '1'},
{date: '2020-04-01', county: 'Albany', state: 'New York', fips: '36001', cases: '240', deaths: '2'},
{date: '2020-04-02', county: 'Albany', state: 'New York', fips: '36001', cases: '253', deaths: '2'},
{date: '2020-04-03', county: 'Albany', state: 'New York', fips: '36001', cases: '267', deaths: '4'},
{date: '2020-04-04', county: 'Albany', state: 'New York', fips: '36001', cases: '293', deaths: '6'},
{date: '2020-04-05', county: 'Albany', state: 'New York', fips: '36001', cases: '305', deaths: '8'},
{date: '2020-04-01', county: 'Allegany', state: 'New York', fips: '36003', cases: '10', deaths: '1'},
...
]
They also have NYC as a separate entry [empty fips]:
{date: '2020-03-14', county: 'New York City', state: 'New York', fips: '', cases: '269', deaths: '1'},
{date: '2020-03-15', county: 'New York City', state: 'New York', fips: '', cases: '330', deaths: '5'},
{date: '2020-03-16', county: 'New York City', state: 'New York', fips: '', cases: '464', deaths: '7'},
{date: '2020-03-17', county: 'New York City', state: 'New York', fips: '', cases: '645', deaths: '10'},
{date: '2020-03-18', county: 'New York City', state: 'New York', fips: '', cases: '1339', deaths: '20'},
{date: '2020-03-19', county: 'New York City', state: 'New York', fips: '', cases: '2468', deaths: '22'},
That looks encouraging, @lazd what do you think? @cristipp, do you feel you could take a shot at writing a scraper for this data?
I'd be happy to. Though I see it already appearing in the https://coronadatascraper.com/#crosscheck for many [all?] counties. Perhaps you don't have it for state-level data?
Oh, I found NY at state level too: https://coronadatascraper.com/#crosscheck:iso2:US-NY-iso1:US. It appears the scrapper prefers the arcgis dataset for some reason.
FWIW, looks like most recent data + deaths + tested for iso2:US-NY is coming from https://covidtracking.com, see https://coronadatascraper.com/#crosscheck:iso2:US-NY-iso1:US.
cases | deaths | tested | recovered | |
---|---|---|---|---|
https://health.data.ny.gov/api/views/xdss-u53e/rows.csv?accessType=DOWNLOAD | 130689 | - | 320811 | - |
https://covidtracking.com/api/states | 130689 | 4758 | 320811 | - |
https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv | 122911 | 3483 | - | - |
https://github.com/CSSEGISandData/COVID-19 | 131815 | 4389 | - | - |
I believe that @hyperknot has closed out this issue by upping the priority of the covidtracking scraper. @cristipp, what's your feeling?
Good to have the state level data fixed. We're still lacking county level data for NY fatalities [0].
The county level fatalities can be pulled from NYT [1] or USAFacts [2], with the quirk that NYT sums the 5 counties of NYC together. Also note these sources don't report 'tested', which CoronaDataScraper does.
We have covidatlas/coronadatascraper#876, which adds county-level fatalities data from NYT. Alas, that PR does not fill in data for NYC counties [Kings, Queens, New York, Bronx, Richmond, see https://en.wikipedia.org/wiki/Boroughs_of_New_York_City] because of the aforementioned quirk, which makes it less useful than I'd like.
There is a general question on whether CoronaDataScraper wants to 'fallback' missing data from other metaaggregators as a post-processing step on a cell-by-cell basis. Think of an extra field 'fallback: true', which enables the fallback behavior on a source by source basis, in priority order. Then we could add nyt, usafacts and whatnot. Would that be something of interest?
[0] coronadatascraper: [
{key: 'US-NY', date: '2020-04-22', cases: 257216, tested: 669982, deaths: 15302},
{key: 'US-NY-Queens County', date: '2020-04-22', cases: 43713, tested: 88388, deaths: undefined},
{key: 'US-NY-Kings County', date: '2020-04-22', cases: 38481, tested: 81787, deaths: undefined},
{key: 'US-NY-Nassau County', date: '2020-04-22', cases: 31555, tested: 74571, deaths: undefined},
{key: 'US-NY-Bronx County', date: '2020-04-22', cases: 30868, tested: 65304, deaths: undefined},
{key: 'US-NY-Suffolk County', date: '2020-04-22', cases: 28854, tested: 71268, deaths: undefined},
{key: 'US-NY-Westchester County', date: '2020-04-22', cases: 25276, tested: 76564, deaths: undefined},
{key: 'US-NY-New York County', date: '2020-04-22', cases: 19025, tested: 49687, deaths: undefined},
{key: 'US-NY-Richmond County', date: '2020-04-22', cases: 10345, tested: 26289, deaths: undefined},
{key: 'US-NY-Rockland County', date: '2020-04-22', cases: 9699, tested: 23150, deaths: undefined},
... 53 more
]
[1] nyt: [
{key: 'US-NY', date: '2020-04-22', cases: 257246, deaths: 15302},
{key: 'US-NY-New York City', date: '2020-04-22', cases: 142442, deaths: 10614},
{key: 'US-NY-Nassau', date: '2020-04-22', cases: 31555, deaths: 1764},
{key: 'US-NY-Suffolk', date: '2020-04-22', cases: 28854, deaths: 959},
{key: 'US-NY-Westchester', date: '2020-04-22', cases: 25275, deaths: 932},
{key: 'US-NY-Rockland', date: '2020-04-22', cases: 9699, deaths: 309},
{key: 'US-NY-Orange', date: '2020-04-22', cases: 6705, deaths: 183},
{key: 'US-NY-Dutchess', date: '2020-04-22', cases: 2391, deaths: 57},
{key: 'US-NY-Erie', date: '2020-04-22', cases: 2233, deaths: 174},
{key: 'US-NY-Monroe', date: '2020-04-22', cases: 1112, deaths: 72},
... 50 more
]
[2] usafacts: [
{key: 'US-NY-Queens County', date: '2020-04-22', cases: 43713, deaths: 3432},
{key: 'US-NY-Kings County', date: '2020-04-22', cases: 38481, deaths: 3458},
{key: 'US-NY-Nassau County', date: '2020-04-22', cases: 31555, deaths: 1431},
{key: 'US-NY-Bronx County', date: '2020-04-22', cases: 31130, deaths: 2258},
{key: 'US-NY-Suffolk County', date: '2020-04-22', cases: 28854, deaths: 926},
{key: 'US-NY-Westchester County', date: '2020-04-22', cases: 25276, deaths: 838},
{key: 'US-NY-New York County', date: '2020-04-22', cases: 19025, deaths: 1337},
{key: 'US-NY-Richmond County', date: '2020-04-22', cases: 10405, deaths: 492},
{key: 'US-NY-Rockland County', date: '2020-04-22', cases: 9699, deaths: 334},
{key: 'US-NY-Orange County', date: '2020-04-22', cases: 6690, deaths: 185},
... 54 more
]
Hi @cristipp - getting back to this one after a long delay!
The reports from Li at https://covidatlas.com/data merge data sources by priority. If a lower-priority source supplies a data point that no higher-pri source has, that value is preserved, and we also give the final source selected for each data point (see timeseries-byLocation.json). I believe we're doing what you've suggested.
I believe this issue can be closed -- thoughts?
Hi @cristipp and @zbraniecki - getting back to this one after a long delay!
The reports from Li at https://covidatlas.com/data merge data sources by priority. If a lower-priority source supplies a data point that no higher-pri source has, that value is preserved, and we also give the final source selected for each data point (see timeseries-byLocation.json). I believe we're doing what you've suggested.
I believe this issue can be closed -- thoughts?
States with reported deaths that are not in today's data:
Compared to https://coronavirus.1point3acres.com/en