covidatlas / coronadatascraper

COVID-19 Coronavirus data scraped from government and curated data sources.
https://coronadatascraper.com
BSD 2-Clause "Simplified" License
364 stars 179 forks source link

PA, USA: Data is incorrect, likely reversed deaths and number of cases #409

Closed cburkins closed 4 years ago

cburkins commented 4 years ago

The PA (Pennsylvania) data appears to be incorrect now. It's now showing a very low number for number of cases (e.g. about 12). In clicking through to the data source for the scraper, I'm guessing that the scraper (or website) reversed the columns for number of cases and number of deaths.

cmcjacob commented 4 years ago
✅ Data scraped!
   - 0 cities
   - 1 states
   - 67 counties
   - 0 countries
ℹ️  Total counts (tracked cases, may contain duplicates):
   - 1703 cases
   - 16441 tested
   - 0 recovered
   - 32 deaths
   - 0 active

can't reproduce locally?

cburkins commented 4 years ago

Thanks for the quick response !

This is the PA,USA data object at https://coronadatascraper.com/#timeseries-byLocation.json

Which shows the discrepancy I'm describing. Is that helpful ?

image

lazd commented 4 years ago

Yeah I saw the same, that looks totally wrong...

jzohrab commented 4 years ago

I just tested this locally (yarn start --location "PA, USA"), and logged the values the scraper was getting. Output:

{ county: 'Adams County', cases: 7, deaths: 0 }
{ county: 'Allegheny County', cases: 133, deaths: 2 }
{ county: 'Armstrong County', cases: 1, deaths: 0 }
... [snip] ...
{ county: 'Wayne County', cases: 6, deaths: 0 }
{ county: 'Westmoreland County', cases: 24, deaths: 0 }
{ county: 'York County', cases: 21, deaths: 0 }

Those numbers match the data currently shown on https://www.health.pa.gov/topics/disease/coronavirus/Pages/Cases.aspx.

image

The numbers also match the summary table at the top

​Negative Positive ​Deaths
16,441 1,687 16

The numbers also match news headlines found with search "pennsylvania covid deaths".

In short, it looks ok to me, based on what's out there.

@cburkins: What were you expecting, what feels off about these numbers to you?

jzohrab commented 4 years ago

Interesting, they also have an Archive page which lists several dates, and matches the 6/7/11 numbers @cburkins pointed out in https://github.com/lazd/coronadatascraper/issues/409#issuecomment-604758785:

https://www.health.pa.gov/topics/disease/coronavirus/Pages/Archives.aspx.

cburkins commented 4 years ago

Thanks all for looking at this. I think most of the PA county-level data is correct. It seems to be the roll-up to PA sum data that feels off.

jzohrab commented 4 years ago

PA data is out of whack again due to the page https://www.health.pa.gov/topics/disease/coronavirus/Pages/Cases.aspx now including some age ranges with percents, which are getting reported in the data:

​Age Range ​Percent of Cases
... ...
​50-64 ​28%
​65+ ​18%

data.json:

  {
    "state": "PA",
    "country": "USA",
    "aggregate": "county",
    "url": "https://www.health.pa.gov/topics/disease/coronavirus/Pages/Cases.aspx",
    "county": "65+ County",
    "cases": 18,
    "deaths": 18,
    "rating": 0.47058823529411764
  },

This is a change as of today to the page layout, I'll work on this now.

cburkins commented 4 years ago

As a PA resident, many thanks to you!

On Mar 27, 2020, at 7:25 PM, JZ notifications@github.com wrote:

 PA data is out of whack again due to the page https://www.health.pa.gov/topics/disease/coronavirus/Pages/Cases.aspx now including some age ranges with percents, which are getting reported in the data:

​Age Range ​Percent of Cases ... ... ​50-64 ​28% ​65+ ​18% data.json:

{ "state": "PA", "country": "USA", "aggregate": "county", "url": "https://www.health.pa.gov/topics/disease/coronavirus/Pages/Cases.aspx", "county": "65+ County", "cases": 18, "deaths": 18, "rating": 0.47058823529411764 }, This is a change as of today to the page layout, I'll work on this now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

lazd commented 4 years ago

@jzohrab @cburkins was this fixed by #452? Check the data we reported last night and close this issue if it's looking proper.

dmedwards commented 4 years ago

I'm not sure if this is the same issue but the "like JHU" data also appears to have figures that are too low for the past week or so.

,,PA,USA,41.12951166463159,-77.60961308037935,12801989,https://www.health.pa.gov/topics/disease/coronavirus/Pages/Cases.aspx,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,12,16,,41,47,63,71,96,133,185,268,2,2,6,7,11,16,22

cburkins commented 4 years ago

Hmm, didn't pull the repo and deploy myself, as I'm leveraging the data available on https://coronadatascraper.com/#timeseries-byLocation.json

Looking at that data, still shows incorrect values.... Perhaps it will be correct tomorrow when it pulls the new data for today ?

image

lazd commented 4 years ago

I fixed this last night in https://github.com/lazd/coronadatascraper/commit/392290719e7cb414f890b6a637fb29fa8327ba67, it looks good now

cburkins commented 4 years ago

Agreed, PA data looks good now !