covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Death data is wrong on 6/26 for the San Francisco Bay Area counties #363

Open kengo-sony opened 4 years ago

kengo-sony commented 4 years ago

Other counties shows wrong huge deaths number on 6/26 only.

jzohrab commented 4 years ago

Well that's total garbage. Thanks @kengo-sony for the issue!

A request: if you're using timeseries-byLocation.json, in the future when reporting data issues, please also include the dateSources section of the file for the location in question ... it helps me determine where to look. :-) In this case, that section says "2020-04-15..2020-07-29": "us-covidtracking", so the us-covidtracking source is the one causing the trouble.

Cheers, looking into it! jz

jzohrab commented 4 years ago

Correction, it was actually "2020-03-21..2020-07-31": "us-ca-mercury-news".

jzohrab commented 4 years ago

Should be fixed in #366. I'll launch it to prod and the data should be regenerated in at most a few days.

jzohrab commented 4 years ago

Thanks again @kengo-sony ! jz

kengoy commented 4 years ago

Thanks for your quick fix @jzohrab ! 6/26 data looks good now.

Yes, I will make sure to include the dateSources section when reporting a data issue.

Thanks again!

kengoy commented 4 years ago

Sorry again, @jzohrab.

I see another issue in Alameda County. I could be an side effect of the fix. "cases" turns small from 6/24 suddenly, it looks daily cases number instead of cumulative number, and turned back to cumulative number on 8/1 .

  "2020-06-22": {
    "cases": 5007,
    "deaths": 120,
    "growthFactor": 1.04
  },
  "2020-06-23": {
    "cases": 5140,
    "deaths": 120,
    "growthFactor": 1.03
  },
  "2020-06-24": {
    "cases": 5,
    "deaths": 122,
    "growthFactor": 0
  },
  "2020-06-25": {
    "cases": 5,
    "deaths": 128,
    "growthFactor": 1
  },

...

  "2020-07-31": {
    "cases": 11,
    "deaths": 182,
    "growthFactor": 1
  },
  "2020-08-01": {
    "cases": 11131,
    "deaths": 182,
    "growthFactor": 1011.91
  }
},

Here is the data source.

"dateSources": {
  "2020-01-24..2020-02-29": "jhu-usa",
  "2020-03-01..2020-03-20": {
    "jhu-usa": [
      "deaths"
    ],
    "nyt": [
      "cases"
    ]
  },
  "2020-03-21..2020-07-31": "us-ca-mercury-news",
  "2020-08-01": "jhu-usa"
},
1ec5 commented 4 years ago

I’m seeing something similar with case counts, except it doesn’t go back to being a cumulative number: #370. I’m also seeing more deaths than cases for the following California counties:

and more recoveries than cases for the following counties:

and no cases for the following counties that have had cases:

1ec5 commented 4 years ago

371 fixed some but not all of the issues in https://github.com/covidatlas/li/issues/363#issuecomment-668190268.

TomGoBravo commented 4 years ago

Tested looks good in the "California County Coronavirus Reporting" Google Spreadsheet maintained by Harriet Rowan but the data I'm fetching from https://coronadatascraper.com/timeseries.csv.zip is still broken for Contra Costa County. Do you think this is due to caching or remaining issues with parsing?

1ec5 commented 4 years ago

Here’s what timeseries-byLocation.json says for Contra Costa County in August:

      "2020-08-01": {
        "cases": 7806,
        "deaths": 121,
        "hospitalized_current": 106,
        "tested": 135408,
        "growthFactor": 1.02
      },
      "2020-08-02": {
        "cases": 7966,
        "deaths": 125,
        "hospitalized_current": 107,
        "tested": 136325,
        "growthFactor": 1.02
      },
      "2020-08-03": {
        "cases": 8033,
        "deaths": 127,
        "hospitalized_current": 100,
        "tested": 136801,
        "growthFactor": 1.01
      },
      "2020-08-04": {
        "cases": 8176,
        "deaths": 131,
        "hospitalized_current": 101,
        "tested": 137460,
        "growthFactor": 1.02
      },
      "2020-08-05": {}

137,460 matches what the spreadsheet shows for August 4 in Contra Costa County. The empty object for August 5 might be because the spreadsheet already shows data for some counties on August 5. The scraper only avoids returning a result if no county has reported data on a certain date:

https://github.com/covidatlas/li/blob/e5f764d15a6108eebad6ad1438419e029a34f475/src/shared/sources/us/ca/mercury-news.js#L63-L65

TomGoBravo commented 4 years ago

Apologies for what may have been a false alarm. I agree that cases for Contra Costa County now look good.

1ec5 commented 4 years ago

The COVID Atlas site still shows 15,500 deaths in Santa Clara County and similarly catastrophic spikes across the Bay Area on June 26, as originally reported above:

santa-clara

One solution is to stand up alternative scrapers that will be preferred over the Mercury News source, such as #375 for Santa Clara County, #378 for Alameda County, and #379 in Marin County.

1ec5 commented 4 years ago

As a followup to https://github.com/covidatlas/li/issues/363#issuecomment-669635453, San Mateo County and possibly others are showing an explicit 0 cases on recent days for which there’s no data, instead of undefined:

    "2020-08-05": {
      "cases": 5758,
      "deaths": 120,
      "hospitalized_current": 60,
      "tested": 107268,
      "icu_current": 15,
      "growthFactor": 1
    },
    "2020-08-06": {
      "cases": 0,
      "deaths": 0
    },
    "2020-08-07": {
      "cases": 0,
      "deaths": 0
    },
    "2020-08-08": {
      "cases": 0,
      "deaths": 0
    },
    "2020-08-09": {
      "cases": 0,
      "deaths": 0
    }
jzohrab commented 4 years ago

Yep I don't know why some are coded that way, it's incorrect. Thanks for catching it.