covidatlas / coronadatascraper

COVID-19 Coronavirus data scraped from government and curated data sources.
https://coronadatascraper.com
BSD 2-Clause "Simplified" License
367 stars 180 forks source link

Replace San Francisco scraper with official API #1022

Closed 1ec5 closed 4 years ago

1ec5 commented 4 years ago

Summary

Replaced the scraper for the City and County of San Francisco with a new implementation that joins and pivots three datasets provided by the city’s DataSF portal in JSON format through an official API:

Previously, this project scraped the Department of Public Health’s COVID-19 landing page for the current statistics. The page didn’t provide any historical data, so this project would cache past days’ values of the current case total, and clients would tend to treat those values as the actual infection totals as of those days.

Now the project pulls in the full time series as revised by the department each day, which is optimal for charting infection rates. It also adds hospitalization and test counts.

Fixes #1011.

Changes

This implementation pivots each dataset on the date field, summing case counts and distinguishing between case count and death toll for the first dataset.

Additional notes

jzohrab commented 4 years ago

Super work @1ec5, thank you! I'll look into this one today. Cheers!

jzohrab commented 4 years ago

Today got away from me, I'll schedule time to look at this tomorrow.

1ec5 commented 4 years ago

There’s something wrong with this scraper, but I can’t put my finger on it. It’s coming up with inaccurate case totals because of what seems to be a stale cache of the API response. For example, on my machine, the cached copy of tvq9-ec9w.json still has an entry for a new confirmed case on March 7, but the current dataset no longer includes any new cases for March 7. (That case probably got moved to a different date.)

jzohrab commented 4 years ago

Hm, getting inaccurate numbers, I think. After running this, I get

   - 0 cities
   - 0 states
   - 1 counties
   - 0 countries
ℹ️  Total counts (tracked cases, may contain duplicates):
   - 968 cases
   - 35266 tested
   - 0 recovered
   - 0 deaths
   - 0 active

0 deaths, but https://data.sfgov.org/resource/tvq9-ec9w.json shows deaths. Looking into it.

jzohrab commented 4 years ago

A few issues:

┌────────────┬───────┬────────┐
│  (index)   │ cases │ deaths │
├────────────┼───────┼────────┤
│ 2020-03-05 │   2   │        │
│ 2020-03-06 │   6   │        │
│ 2020-03-08 │  11   │        │
...
│ 2020-03-16 │  37   │        │
│ 2020-03-17 │  48   │   1    │
│ 2020-03-18 │  62   │        │
...
│ 2020-03-23 │  165  │        │
│ 2020-03-24 │  194  │   2    │
│ 2020-03-25 │  243  │        │
│ 2020-03-26 │  279  │   5    │
│ 2020-03-27 │  315  │   7    │
│ 2020-03-28 │  344  │        │
│ 2020-03-29 │  381  │        │
│ 2020-03-30 │  415  │   8    │
│ 2020-03-31 │  442  │   9    │
│ 2020-04-01 │  481  │   12   │
│ 2020-04-02 │  542  │        │

So that would look like deaths = '' on 04-02, but it should be 12.

Working on some changes, which I'll push here before merging. Thanks!

1ec5 commented 4 years ago

"reduceRight" takes things off of the end of the array. If the data is in date order, that would result in the running totals being backwards (starting off at 0 as at latest date, and then increasing). "reduce" is correct.

When I had originally fetched the JSON files, they were coming in in reverse chronological order. But you’re right, it’s totally unsorted now. They must’ve created the dataset with some other data that was already in reverse chronological order but then continued to add records without sorting them. Thanks for looking into this issue!

jzohrab commented 4 years ago

Ah interesting about the sorting changes. Cheers, still working on it!

jzohrab commented 4 years ago

I opened a PR to this branch: https://github.com/1ec5/coronadatascraper/pull/2

As you said the source data changed, I think what I suggest is foolproof. :-)

jzohrab commented 4 years ago

Hi @1ec5 - are you good with my PR, can you merge it, and then we merge this one?

jzohrab commented 4 years ago

Closed, replaced by https://github.com/covidatlas/coronadatascraper/pull/1044.

:tada: