covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Number of US states are missing deaths/tested #398

Closed jzohrab closed 4 years ago

jzohrab commented 4 years ago

Original issue https://github.com/covidatlas/coronadatascraper/issues/418, transferred here on Friday Mar 27, 2020 at 06:58 GMT


States with reported deaths that are not in today's data:

Compared to https://coronavirus.1point3acres.com/en

jzohrab commented 4 years ago

(Transferred comment)

And tested I think as well.

In the report.json for NY I see:

country:"USA"
url:"https://covidtracking.com/api/states"
type:"json"
curators:Array[1]
aggregate:"state"
priority:-0.5
timeseries:false
headless:false
certValidation:true
state:"NY"
deaths:385
tested:122104
cases:37258
ssl:true
rating:0.49019607843137253

but in byLocation.json, I see:

      "2020-3-26": {
        "cases": 37258,
        "growthFactor": 1.209243452013891
      }

Same for Texas

jzohrab commented 4 years ago

(Transferred comment)

They're also reported with aggregation: "county". I think they're deduped against counties because all those states have both county and state with that name.

jzohrab commented 4 years ago

(Transferred comment)

@lazd sorry to poke you, but I think this is pretty severe and I'd like to make sure it doesn't escape your attention before the next update. Can you mark it with appropriate labels?

jzohrab commented 4 years ago

(Transferred comment)

No worries @zbraniecki. As part of #410, I will try to get tested data back via COVIDTracking. That said, I don't think we can get deaths unless we have it somewhere.

jzohrab commented 4 years ago

(Transferred comment)

COVIDTracking has deaths for those states:

Will that also fix it?

jzohrab commented 4 years ago

(Transferred comment)

And #410 seems to be about combining data from two sources. My suspicion is that this bug is about two sources for two different things (county vs. state) ending up conflated as two sources of the same thing and the county one wins.

Here's what's in report.json for "NY, USA":

      "NY, USA": [
        {
          "country": "USA",
          "url": "https://covidtracking.com/api/states",
          "type": "json",
          "curators": [
            {
              "name": "The COVID Tracking Project",
              "url": "https://covidtracking.com/",
              "twitter": "@COVID19Tracking",
              "github": "COVID19Tracking"
            }
          ],
          "aggregate": "state",
          "priority": -0.5,
          "timeseries": false,
          "headless": false,
          "certValidation": true,
          "state": "NY",
          "deaths": 385,
          "tested": 122104,
          "cases": 37258,
          "ssl": true,
          "rating": 0.49019607843137253
        },
        {
          "state": "NY",
          "country": "USA",
          "type": "table",
          "aggregate": "county",
          "timeseries": false,
          "headless": false,
          "certValidation": true,
          "priority": 0,
          "url": "https://coronavirus.health.ny.gov/county-county-breakdown-positive-cases",
          "cases": 37258,
          "ssl": true,
          "rating": 0.3137254901960784
        }
      ],

I think this is a different thing than #410. - mainly, those counties should not be conflated with states. and "NY, USA" should be a state and collect state data.

jzohrab commented 4 years ago

(Transferred comment)

No, those counties cases are rolled up into state totals, which is exactly what we want to do. However, testing numbers aren't being reported on a per-county basis, so they're not getting rolled up.

So what we want to do is take our rolled up case numbers and take COVIDTracking's testing numbers, which is what #410 is about.

jzohrab commented 4 years ago

(Transferred comment)

Just as another data point, I'm still seeing deaths == nan for all of NY state and city.

# omitted: load `df` from `timeseries.csv`, parse dates, drop lat/lon/url columns
>> usall = df[(df["country"] == "USA")]
>> usall[(usall.deaths.notnull()) & (usall.deaths>0)].state.unique()
array(['WA', 'CA', 'MA', 'GA', 'FL', 'NJ', 'OR', 'IL', 'PA', 'IA', 'NC',
       'SC', 'IN', 'KY', 'NV', 'OH', 'WI', 'CT', 'HI', 'OK', 'UT', 'KS',
       'LA', 'MO', 'VT', 'AR', 'ID', 'ME', 'MI', 'MS', 'NM', 'ND', 'SD',
       'CO', nan, 'VA', 'DC', 'AL', 'PR', 'GU', 'AK', 'MN'], dtype=object)

(Personally I'm less interested in tested, except to the extent that it's caused by the same underlying issue... reports on number tested have been inconsistent across most aggregators; cases+deaths have been more reliable).

Also, thanks for putting this dataset together; I've been lurking for a while and am impressed with the work y'all're putting in. Unfortunately I just migrated the daily updates I send to friends and family on cases+deaths (in the states we live in) to use this timeseries data; bad timing I guess :)

Good luck with the fix, and thanks again!

jzohrab commented 4 years ago

(Transferred comment)

@mvanmidd we don't have a source for deaths in NY on a per-county basis: https://coronavirus.health.ny.gov/county-county-breakdown-positive-cases

NYC only notes deaths for the entire city: https://www1.nyc.gov/assets/doh/downloads/pdf/imm/covid-19-daily-data-summary.pdf

We can pull deaths for NYC from the daily update PDF, but we're out of luck for the rest of the New York counties. Until we implement #410, we won't be pulling deaths for NY state either, unfortunately.

jzohrab commented 4 years ago

(Transferred comment)

Gotcha, thanks for the update. #410 seems like a big one, good luck! y'all are going to have a fully featured generic/configurable ETL framework pretty soon :)

In all seriousness, I think the auditable data aggregation is the biggest strength of this project... there's plenty of fronted work going on elsewhere (e.g. the explosion of "babby's first plotly visualizations," including my own), and on the backend, lots of data sources that are either incomplete or opaque. Keep up the good work!

jzohrab commented 4 years ago

(Transferred comment)

For US state/county data, how about NYT repo: https://github.com/nytimes/covid-19-data ?

jzohrab commented 4 years ago

(Transferred comment)

@cristipp - good find. I haven't looked at the actual data, but the README is encouraging.

jzohrab commented 4 years ago

(Transferred comment)

Rows: 1884
Columns: date, state, fips, cases, deaths
[
    {date: '2020-03-01', state: 'New York', fips: '36', cases: '1', deaths: '0'},
    {date: '2020-03-02', state: 'New York', fips: '36', cases: '1', deaths: '0'},
    {date: '2020-03-03', state: 'New York', fips: '36', cases: '2', deaths: '0'},
    {date: '2020-03-04', state: 'New York', fips: '36', cases: '11', deaths: '0'},
    {date: '2020-03-05', state: 'New York', fips: '36', cases: '22', deaths: '0'},
    {date: '2020-03-06', state: 'New York', fips: '36', cases: '44', deaths: '0'},
    {date: '2020-03-07', state: 'New York', fips: '36', cases: '89', deaths: '0'},
    {date: '2020-03-08', state: 'New York', fips: '36', cases: '106', deaths: '0'},
    {date: '2020-03-09', state: 'New York', fips: '36', cases: '142', deaths: '0'},
     ...
]

And for county level:

[
    {date: '2020-03-28', county: 'Albany', state: 'New York', fips: '36001', cases: '195', deaths: '1'},
    {date: '2020-03-29', county: 'Albany', state: 'New York', fips: '36001', cases: '205', deaths: '1'},
    {date: '2020-03-30', county: 'Albany', state: 'New York', fips: '36001', cases: '217', deaths: '1'},
    {date: '2020-03-31', county: 'Albany', state: 'New York', fips: '36001', cases: '226', deaths: '1'},
    {date: '2020-04-01', county: 'Albany', state: 'New York', fips: '36001', cases: '240', deaths: '2'},
    {date: '2020-04-02', county: 'Albany', state: 'New York', fips: '36001', cases: '253', deaths: '2'},
    {date: '2020-04-03', county: 'Albany', state: 'New York', fips: '36001', cases: '267', deaths: '4'},
    {date: '2020-04-04', county: 'Albany', state: 'New York', fips: '36001', cases: '293', deaths: '6'},
    {date: '2020-04-05', county: 'Albany', state: 'New York', fips: '36001', cases: '305', deaths: '8'},
    {date: '2020-04-01', county: 'Allegany', state: 'New York', fips: '36003', cases: '10', deaths: '1'},
   ...
]

They also have NYC as a separate entry [empty fips]:

{date: '2020-03-14', county: 'New York City', state: 'New York', fips: '', cases: '269', deaths: '1'},
    {date: '2020-03-15', county: 'New York City', state: 'New York', fips: '', cases: '330', deaths: '5'},
    {date: '2020-03-16', county: 'New York City', state: 'New York', fips: '', cases: '464', deaths: '7'},
    {date: '2020-03-17', county: 'New York City', state: 'New York', fips: '', cases: '645', deaths: '10'},
    {date: '2020-03-18', county: 'New York City', state: 'New York', fips: '', cases: '1339', deaths: '20'},
    {date: '2020-03-19', county: 'New York City', state: 'New York', fips: '', cases: '2468', deaths: '22'},
jzohrab commented 4 years ago

(Transferred comment)

That looks encouraging, @lazd what do you think? @cristipp, do you feel you could take a shot at writing a scraper for this data?

jzohrab commented 4 years ago

(Transferred comment)

I'd be happy to. Though I see it already appearing in the https://coronadatascraper.com/#crosscheck for many [all?] counties. Perhaps you don't have it for state-level data?

jzohrab commented 4 years ago

(Transferred comment)

Oh, I found NY at state level too: https://coronadatascraper.com/#crosscheck:iso2:US-NY-iso1:US. It appears the scrapper prefers the arcgis dataset for some reason.

FWIW, looks like most recent data + deaths + tested for iso2:US-NY is coming from https://covidtracking.com, see https://coronadatascraper.com/#crosscheck:iso2:US-NY-iso1:US.

  cases deaths tested recovered
https://health.data.ny.gov/api/views/xdss-u53e/rows.csv?accessType=DOWNLOAD 130689 - 320811 -
https://covidtracking.com/api/states 130689 4758 320811 -
https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv 122911 3483 - -
https://github.com/CSSEGISandData/COVID-19 131815 4389 - -
jzohrab commented 4 years ago

(Transferred comment)

I believe that @hyperknot has closed out this issue by upping the priority of the covidtracking scraper. @cristipp, what's your feeling?

jzohrab commented 4 years ago

(Transferred comment)

Good to have the state level data fixed. We're still lacking county level data for NY fatalities [0].

The county level fatalities can be pulled from NYT [1] or USAFacts [2], with the quirk that NYT sums the 5 counties of NYC together. Also note these sources don't report 'tested', which CoronaDataScraper does.

    [0] coronadatascraper: [
        {key: 'US-NY', date: '2020-04-22', cases: 257216, tested: 669982, deaths: 15302},
        {key: 'US-NY-Queens County', date: '2020-04-22', cases: 43713, tested: 88388, deaths: undefined},
        {key: 'US-NY-Kings County', date: '2020-04-22', cases: 38481, tested: 81787, deaths: undefined},
        {key: 'US-NY-Nassau County', date: '2020-04-22', cases: 31555, tested: 74571, deaths: undefined},
        {key: 'US-NY-Bronx County', date: '2020-04-22', cases: 30868, tested: 65304, deaths: undefined},
        {key: 'US-NY-Suffolk County', date: '2020-04-22', cases: 28854, tested: 71268, deaths: undefined},
        {key: 'US-NY-Westchester County', date: '2020-04-22', cases: 25276, tested: 76564, deaths: undefined},
        {key: 'US-NY-New York County', date: '2020-04-22', cases: 19025, tested: 49687, deaths: undefined},
        {key: 'US-NY-Richmond County', date: '2020-04-22', cases: 10345, tested: 26289, deaths: undefined},
        {key: 'US-NY-Rockland County', date: '2020-04-22', cases: 9699, tested: 23150, deaths: undefined},
        ... 53 more
    ]
    [1] nyt: [
        {key: 'US-NY', date: '2020-04-22', cases: 257246, deaths: 15302},
        {key: 'US-NY-New York City', date: '2020-04-22', cases: 142442, deaths: 10614},
        {key: 'US-NY-Nassau', date: '2020-04-22', cases: 31555, deaths: 1764},
        {key: 'US-NY-Suffolk', date: '2020-04-22', cases: 28854, deaths: 959},
        {key: 'US-NY-Westchester', date: '2020-04-22', cases: 25275, deaths: 932},
        {key: 'US-NY-Rockland', date: '2020-04-22', cases: 9699, deaths: 309},
        {key: 'US-NY-Orange', date: '2020-04-22', cases: 6705, deaths: 183},
        {key: 'US-NY-Dutchess', date: '2020-04-22', cases: 2391, deaths: 57},
        {key: 'US-NY-Erie', date: '2020-04-22', cases: 2233, deaths: 174},
        {key: 'US-NY-Monroe', date: '2020-04-22', cases: 1112, deaths: 72},
        ... 50 more
  ]
  [2] usafacts: [
        {key: 'US-NY-Queens County', date: '2020-04-22', cases: 43713, deaths: 3432},
        {key: 'US-NY-Kings County', date: '2020-04-22', cases: 38481, deaths: 3458},
        {key: 'US-NY-Nassau County', date: '2020-04-22', cases: 31555, deaths: 1431},
        {key: 'US-NY-Bronx County', date: '2020-04-22', cases: 31130, deaths: 2258},
        {key: 'US-NY-Suffolk County', date: '2020-04-22', cases: 28854, deaths: 926},
        {key: 'US-NY-Westchester County', date: '2020-04-22', cases: 25276, deaths: 838},
        {key: 'US-NY-New York County', date: '2020-04-22', cases: 19025, deaths: 1337},
        {key: 'US-NY-Richmond County', date: '2020-04-22', cases: 10405, deaths: 492},
        {key: 'US-NY-Rockland County', date: '2020-04-22', cases: 9699, deaths: 334},
        {key: 'US-NY-Orange County', date: '2020-04-22', cases: 6690, deaths: 185},
        ... 54 more
    ]
jzohrab commented 4 years ago

(Transferred comment)

Hi @cristipp - getting back to this one after a long delay!

The reports from Li at https://covidatlas.com/data merge data sources by priority. If a lower-priority source supplies a data point that no higher-pri source has, that value is preserved, and we also give the final source selected for each data point (see timeseries-byLocation.json). I believe we're doing what you've suggested.

I believe this issue can be closed -- thoughts?

jzohrab commented 4 years ago

Dupe of https://github.com/covidatlas/li/issues/546.