covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Marin County cases data is hugely wrong. #358

Closed kengoy closed 4 years ago

kengoy commented 4 years ago

"Marin County" cases data turns hugely wrong from July 24.

  "2020-07-22": {
    "cases": 2398,
    "deaths": 35,
    "tested": 45815,
    "hospitalized": 99,
    "recovered": 1851,
    "icu": 0,
    "growthFactor": 1.03
  },
  "2020-07-23": {
    "cases": 2416,
    "deaths": 35,
    "tested": 46430,
    "hospitalized": 99,
    "recovered": 1916,
    "icu": 0,
    "growthFactor": 1.01
  },
  "2020-07-24": {
    **"cases": 4489,**
    "deaths": 39,
    "tested": 46906,
    "hospitalized": 99,
    "recovered": 2000,
    "icu": 0,
    "growthFactor": 1.86
  },
  "2020-07-25": {
    **"cases": 4489,**
    "deaths": 39,
    "tested": 46906,
    "hospitalized": 99,
    "recovered": 2000,
    "icu": 0,
    "growthFactor": 1
  },

FYI: COVID-19 in Marin County: https://coronavirus.marinhhs.org/

jzohrab commented 4 years ago

Hello @kengoy , thanks for the issue.

I believe our number is valid. We're showing cumulative cases over time. The page you reference is giving current state. In other words, there are 2706 current cases, but there used to be 2292 more (which have since recovered). 2706 + 2292 = 4998 (as of today).

I've asked our team to verify the interpretation of the data to be sure. Keeping this open for now.


Some notes:

In the timeseries-byLocation.json report, I checked the dateSources for Marin County:

    "dateSources": {
      "2020-01-24..2020-02-25": "jhu-usa",
...
      "2020-03-20": "jhu-usa",
      "2020-03-21..2020-07-31": "us-ca-mercury-news"
    },

so we've been using us-ca-mercury-news source as the data source for several months. That code is in src/shared/sources/us/ca/mercury-news.ca, and you'll see it's pulling from https://docs.google.com/spreadsheets/d/1CwZA4RPNf_hUrwzNLyGGNHRlh1cwl8vDHwIoae51Hac/gviz/tq?tqx=out:csv&sheet=timeseries.

That google spreadsheet is apparently sourcing its data from https://coronavirus.marinhhs.org/surveillance#today.

As of now that site says

image

They're reporting "cases" = "cases + recovered", which makes sense.

kengo-sony commented 4 years ago

Hello @jzohrab , thank you for checking this and gave me the explanation in detail. I was wondering and doubt the number because

Our site(https://panda.baybrigades.org/) shows a time series chart for San Francisco Bay Area Counties referring the dataset scraped by Corona Data Scraper (Thanks so so much!!), and Marin County shows a huge jump on 7/24.

Screen Shot 2020-07-31 at 1 32 44 AM

I am still not quite sure if Marin County Public Health site is showing the data as "cases" = "cases + recovered". I will check.

kengo-sony commented 4 years ago

I think I found the reason. According to https://coronavirus.marinhhs.org/surveillance#today, it says "This information shows how COVID-19 has impacted Marin County community residents since the first case. None of the data in this section include San Quentin inmates." So Marin County Public Health's site number 2,706 (as of 7/29)does not include the number of San Quentin inmates which counts 2,168(as of 7/29). But other sites including NYT, mercury news, SF Chronicle etc show the number as 4,874(as of 7/29) including the inmates. I guess most of the sites started to count the number including the inmates from 7/24. Still, Marin County Public Health's site continues to show the number excluding the inmates. That's why we see the huge difference. Now it makes sense to me.

Screen Shot 2020-07-29 at 8 57 07 PM

kengo-sony commented 4 years ago

Sorry, I noticed I was using my different github account. I am @kengoy as well.

jzohrab commented 4 years ago

Cheers @kengo-sony / @kengoy . ありがとうございます。これで、良いですか?Do you think we can close this issue?

kengo-sony commented 4 years ago

@jzohrab Yes, we can close this issue. Thank you. Arigato.

1ec5 commented 4 years ago

379 adds a scraper for Marin County that relies on Marin Health & Human Services as its source. I think it would be beneficial to use the county’s dataset, even if it differs from aggregators, in part because allows the time series to be internally consistent.