Closed jzohrab closed 4 years ago
(Transferred comment)
And tested
I think as well.
In the report.json
for NY
I see:
country:"USA"
url:"https://covidtracking.com/api/states"
type:"json"
curators:Array[1]
aggregate:"state"
priority:-0.5
timeseries:false
headless:false
certValidation:true
state:"NY"
deaths:385
tested:122104
cases:37258
ssl:true
rating:0.49019607843137253
but in byLocation.json
, I see:
"2020-3-26": {
"cases": 37258,
"growthFactor": 1.209243452013891
}
Same for Texas
(Transferred comment)
They're also reported with aggregation: "county"
. I think they're deduped against counties because all those states have both county and state with that name.
(Transferred comment)
@lazd sorry to poke you, but I think this is pretty severe and I'd like to make sure it doesn't escape your attention before the next update. Can you mark it with appropriate labels?
(Transferred comment)
No worries @zbraniecki. As part of #410, I will try to get tested
data back via COVIDTracking. That said, I don't think we can get deaths
unless we have it somewhere.
(Transferred comment)
COVIDTracking has deaths for those states:
Will that also fix it?
(Transferred comment)
And #410 seems to be about combining data from two sources. My suspicion is that this bug is about two sources for two different things (county vs. state) ending up conflated as two sources of the same thing and the county
one wins.
Here's what's in report.json for "NY, USA":
"NY, USA": [
{
"country": "USA",
"url": "https://covidtracking.com/api/states",
"type": "json",
"curators": [
{
"name": "The COVID Tracking Project",
"url": "https://covidtracking.com/",
"twitter": "@COVID19Tracking",
"github": "COVID19Tracking"
}
],
"aggregate": "state",
"priority": -0.5,
"timeseries": false,
"headless": false,
"certValidation": true,
"state": "NY",
"deaths": 385,
"tested": 122104,
"cases": 37258,
"ssl": true,
"rating": 0.49019607843137253
},
{
"state": "NY",
"country": "USA",
"type": "table",
"aggregate": "county",
"timeseries": false,
"headless": false,
"certValidation": true,
"priority": 0,
"url": "https://coronavirus.health.ny.gov/county-county-breakdown-positive-cases",
"cases": 37258,
"ssl": true,
"rating": 0.3137254901960784
}
],
I think this is a different thing than #410. - mainly, those counties should not be conflated with states. and "NY, USA" should be a state and collect state data.
(Transferred comment)
No, those counties cases are rolled up into state totals, which is exactly what we want to do. However, testing numbers aren't being reported on a per-county basis, so they're not getting rolled up.
So what we want to do is take our rolled up case numbers and take COVIDTracking's testing numbers, which is what #410 is about.
(Transferred comment)
Just as another data point, I'm still seeing deaths == nan for all of NY state and city.
# omitted: load `df` from `timeseries.csv`, parse dates, drop lat/lon/url columns
>> usall = df[(df["country"] == "USA")]
>> usall[(usall.deaths.notnull()) & (usall.deaths>0)].state.unique()
array(['WA', 'CA', 'MA', 'GA', 'FL', 'NJ', 'OR', 'IL', 'PA', 'IA', 'NC',
'SC', 'IN', 'KY', 'NV', 'OH', 'WI', 'CT', 'HI', 'OK', 'UT', 'KS',
'LA', 'MO', 'VT', 'AR', 'ID', 'ME', 'MI', 'MS', 'NM', 'ND', 'SD',
'CO', nan, 'VA', 'DC', 'AL', 'PR', 'GU', 'AK', 'MN'], dtype=object)
(Personally I'm less interested in tested, except to the extent that it's caused by the same underlying issue... reports on number tested have been inconsistent across most aggregators; cases+deaths have been more reliable).
Also, thanks for putting this dataset together; I've been lurking for a while and am impressed with the work y'all're putting in. Unfortunately I just migrated the daily updates I send to friends and family on cases+deaths (in the states we live in) to use this timeseries data; bad timing I guess :)
Good luck with the fix, and thanks again!
(Transferred comment)
@mvanmidd we don't have a source for deaths in NY on a per-county basis: https://coronavirus.health.ny.gov/county-county-breakdown-positive-cases
NYC only notes deaths for the entire city: https://www1.nyc.gov/assets/doh/downloads/pdf/imm/covid-19-daily-data-summary.pdf
We can pull deaths for NYC from the daily update PDF, but we're out of luck for the rest of the New York counties. Until we implement #410, we won't be pulling deaths for NY state either, unfortunately.
(Transferred comment)
Gotcha, thanks for the update. #410 seems like a big one, good luck! y'all are going to have a fully featured generic/configurable ETL framework pretty soon :)
In all seriousness, I think the auditable data aggregation is the biggest strength of this project... there's plenty of fronted work going on elsewhere (e.g. the explosion of "babby's first plotly visualizations," including my own), and on the backend, lots of data sources that are either incomplete or opaque. Keep up the good work!
(Transferred comment)
For US state/county data, how about NYT repo: https://github.com/nytimes/covid-19-data ?
(Transferred comment)
@cristipp - good find. I haven't looked at the actual data, but the README is encouraging.
(Transferred comment)
Rows: 1884
Columns: date, state, fips, cases, deaths
[
{date: '2020-03-01', state: 'New York', fips: '36', cases: '1', deaths: '0'},
{date: '2020-03-02', state: 'New York', fips: '36', cases: '1', deaths: '0'},
{date: '2020-03-03', state: 'New York', fips: '36', cases: '2', deaths: '0'},
{date: '2020-03-04', state: 'New York', fips: '36', cases: '11', deaths: '0'},
{date: '2020-03-05', state: 'New York', fips: '36', cases: '22', deaths: '0'},
{date: '2020-03-06', state: 'New York', fips: '36', cases: '44', deaths: '0'},
{date: '2020-03-07', state: 'New York', fips: '36', cases: '89', deaths: '0'},
{date: '2020-03-08', state: 'New York', fips: '36', cases: '106', deaths: '0'},
{date: '2020-03-09', state: 'New York', fips: '36', cases: '142', deaths: '0'},
...
]
And for county level:
[
{date: '2020-03-28', county: 'Albany', state: 'New York', fips: '36001', cases: '195', deaths: '1'},
{date: '2020-03-29', county: 'Albany', state: 'New York', fips: '36001', cases: '205', deaths: '1'},
{date: '2020-03-30', county: 'Albany', state: 'New York', fips: '36001', cases: '217', deaths: '1'},
{date: '2020-03-31', county: 'Albany', state: 'New York', fips: '36001', cases: '226', deaths: '1'},
{date: '2020-04-01', county: 'Albany', state: 'New York', fips: '36001', cases: '240', deaths: '2'},
{date: '2020-04-02', county: 'Albany', state: 'New York', fips: '36001', cases: '253', deaths: '2'},
{date: '2020-04-03', county: 'Albany', state: 'New York', fips: '36001', cases: '267', deaths: '4'},
{date: '2020-04-04', county: 'Albany', state: 'New York', fips: '36001', cases: '293', deaths: '6'},
{date: '2020-04-05', county: 'Albany', state: 'New York', fips: '36001', cases: '305', deaths: '8'},
{date: '2020-04-01', county: 'Allegany', state: 'New York', fips: '36003', cases: '10', deaths: '1'},
...
]
They also have NYC as a separate entry [empty fips]:
{date: '2020-03-14', county: 'New York City', state: 'New York', fips: '', cases: '269', deaths: '1'},
{date: '2020-03-15', county: 'New York City', state: 'New York', fips: '', cases: '330', deaths: '5'},
{date: '2020-03-16', county: 'New York City', state: 'New York', fips: '', cases: '464', deaths: '7'},
{date: '2020-03-17', county: 'New York City', state: 'New York', fips: '', cases: '645', deaths: '10'},
{date: '2020-03-18', county: 'New York City', state: 'New York', fips: '', cases: '1339', deaths: '20'},
{date: '2020-03-19', county: 'New York City', state: 'New York', fips: '', cases: '2468', deaths: '22'},
(Transferred comment)
That looks encouraging, @lazd what do you think? @cristipp, do you feel you could take a shot at writing a scraper for this data?
(Transferred comment)
I'd be happy to. Though I see it already appearing in the https://coronadatascraper.com/#crosscheck for many [all?] counties. Perhaps you don't have it for state-level data?
(Transferred comment)
Oh, I found NY at state level too: https://coronadatascraper.com/#crosscheck:iso2:US-NY-iso1:US. It appears the scrapper prefers the arcgis dataset for some reason.
FWIW, looks like most recent data + deaths + tested for iso2:US-NY is coming from https://covidtracking.com, see https://coronadatascraper.com/#crosscheck:iso2:US-NY-iso1:US.
cases | deaths | tested | recovered | |
---|---|---|---|---|
https://health.data.ny.gov/api/views/xdss-u53e/rows.csv?accessType=DOWNLOAD | 130689 | - | 320811 | - |
https://covidtracking.com/api/states | 130689 | 4758 | 320811 | - |
https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv | 122911 | 3483 | - | - |
https://github.com/CSSEGISandData/COVID-19 | 131815 | 4389 | - | - |
(Transferred comment)
I believe that @hyperknot has closed out this issue by upping the priority of the covidtracking scraper. @cristipp, what's your feeling?
(Transferred comment)
Good to have the state level data fixed. We're still lacking county level data for NY fatalities [0].
The county level fatalities can be pulled from NYT [1] or USAFacts [2], with the quirk that NYT sums the 5 counties of NYC together. Also note these sources don't report 'tested', which CoronaDataScraper does.
We have #876, which adds county-level fatalities data from NYT. Alas, that PR does not fill in data for NYC counties [Kings, Queens, New York, Bronx, Richmond, see https://en.wikipedia.org/wiki/Boroughs_of_New_York_City] because of the aforementioned quirk, which makes it less useful than I'd like.
There is a general question on whether CoronaDataScraper wants to 'fallback' missing data from other metaaggregators as a post-processing step on a cell-by-cell basis. Think of an extra field 'fallback: true', which enables the fallback behavior on a source by source basis, in priority order. Then we could add nyt, usafacts and whatnot. Would that be something of interest?
[0] coronadatascraper: [
{key: 'US-NY', date: '2020-04-22', cases: 257216, tested: 669982, deaths: 15302},
{key: 'US-NY-Queens County', date: '2020-04-22', cases: 43713, tested: 88388, deaths: undefined},
{key: 'US-NY-Kings County', date: '2020-04-22', cases: 38481, tested: 81787, deaths: undefined},
{key: 'US-NY-Nassau County', date: '2020-04-22', cases: 31555, tested: 74571, deaths: undefined},
{key: 'US-NY-Bronx County', date: '2020-04-22', cases: 30868, tested: 65304, deaths: undefined},
{key: 'US-NY-Suffolk County', date: '2020-04-22', cases: 28854, tested: 71268, deaths: undefined},
{key: 'US-NY-Westchester County', date: '2020-04-22', cases: 25276, tested: 76564, deaths: undefined},
{key: 'US-NY-New York County', date: '2020-04-22', cases: 19025, tested: 49687, deaths: undefined},
{key: 'US-NY-Richmond County', date: '2020-04-22', cases: 10345, tested: 26289, deaths: undefined},
{key: 'US-NY-Rockland County', date: '2020-04-22', cases: 9699, tested: 23150, deaths: undefined},
... 53 more
]
[1] nyt: [
{key: 'US-NY', date: '2020-04-22', cases: 257246, deaths: 15302},
{key: 'US-NY-New York City', date: '2020-04-22', cases: 142442, deaths: 10614},
{key: 'US-NY-Nassau', date: '2020-04-22', cases: 31555, deaths: 1764},
{key: 'US-NY-Suffolk', date: '2020-04-22', cases: 28854, deaths: 959},
{key: 'US-NY-Westchester', date: '2020-04-22', cases: 25275, deaths: 932},
{key: 'US-NY-Rockland', date: '2020-04-22', cases: 9699, deaths: 309},
{key: 'US-NY-Orange', date: '2020-04-22', cases: 6705, deaths: 183},
{key: 'US-NY-Dutchess', date: '2020-04-22', cases: 2391, deaths: 57},
{key: 'US-NY-Erie', date: '2020-04-22', cases: 2233, deaths: 174},
{key: 'US-NY-Monroe', date: '2020-04-22', cases: 1112, deaths: 72},
... 50 more
]
[2] usafacts: [
{key: 'US-NY-Queens County', date: '2020-04-22', cases: 43713, deaths: 3432},
{key: 'US-NY-Kings County', date: '2020-04-22', cases: 38481, deaths: 3458},
{key: 'US-NY-Nassau County', date: '2020-04-22', cases: 31555, deaths: 1431},
{key: 'US-NY-Bronx County', date: '2020-04-22', cases: 31130, deaths: 2258},
{key: 'US-NY-Suffolk County', date: '2020-04-22', cases: 28854, deaths: 926},
{key: 'US-NY-Westchester County', date: '2020-04-22', cases: 25276, deaths: 838},
{key: 'US-NY-New York County', date: '2020-04-22', cases: 19025, deaths: 1337},
{key: 'US-NY-Richmond County', date: '2020-04-22', cases: 10405, deaths: 492},
{key: 'US-NY-Rockland County', date: '2020-04-22', cases: 9699, deaths: 334},
{key: 'US-NY-Orange County', date: '2020-04-22', cases: 6690, deaths: 185},
... 54 more
]
(Transferred comment)
Hi @cristipp - getting back to this one after a long delay!
The reports from Li at https://covidatlas.com/data merge data sources by priority. If a lower-priority source supplies a data point that no higher-pri source has, that value is preserved, and we also give the final source selected for each data point (see timeseries-byLocation.json). I believe we're doing what you've suggested.
I believe this issue can be closed -- thoughts?
Original issue https://github.com/covidatlas/coronadatascraper/issues/418, transferred here on Friday Mar 27, 2020 at 06:58 GMT
States with reported deaths that are not in today's data:
Compared to https://coronavirus.1point3acres.com/en