Closed 1ec5 closed 4 years ago
Super work @1ec5, thank you! I'll look into this one today. Cheers!
Today got away from me, I'll schedule time to look at this tomorrow.
There’s something wrong with this scraper, but I can’t put my finger on it. It’s coming up with inaccurate case totals because of what seems to be a stale cache of the API response. For example, on my machine, the cached copy of tvq9-ec9w.json still has an entry for a new confirmed case on March 7, but the current dataset no longer includes any new cases for March 7. (That case probably got moved to a different date.)
Hm, getting inaccurate numbers, I think. After running this, I get
- 0 cities
- 0 states
- 1 counties
- 0 countries
ℹ️ Total counts (tracked cases, may contain duplicates):
- 968 cases
- 35266 tested
- 0 recovered
- 0 deaths
- 0 active
0 deaths, but https://data.sfgov.org/resource/tvq9-ec9w.json shows deaths. Looking into it.
A few issues:
┌────────────┬───────┬────────┐
│ (index) │ cases │ deaths │
├────────────┼───────┼────────┤
│ 2020-03-05 │ 2 │ │
│ 2020-03-06 │ 6 │ │
│ 2020-03-08 │ 11 │ │
...
│ 2020-03-16 │ 37 │ │
│ 2020-03-17 │ 48 │ 1 │
│ 2020-03-18 │ 62 │ │
...
│ 2020-03-23 │ 165 │ │
│ 2020-03-24 │ 194 │ 2 │
│ 2020-03-25 │ 243 │ │
│ 2020-03-26 │ 279 │ 5 │
│ 2020-03-27 │ 315 │ 7 │
│ 2020-03-28 │ 344 │ │
│ 2020-03-29 │ 381 │ │
│ 2020-03-30 │ 415 │ 8 │
│ 2020-03-31 │ 442 │ 9 │
│ 2020-04-01 │ 481 │ 12 │
│ 2020-04-02 │ 542 │ │
So that would look like deaths = '' on 04-02, but it should be 12.
Working on some changes, which I'll push here before merging. Thanks!
"reduceRight" takes things off of the end of the array. If the data is in date order, that would result in the running totals being backwards (starting off at 0 as at latest date, and then increasing). "reduce" is correct.
When I had originally fetched the JSON files, they were coming in in reverse chronological order. But you’re right, it’s totally unsorted now. They must’ve created the dataset with some other data that was already in reverse chronological order but then continued to add records without sorting them. Thanks for looking into this issue!
Ah interesting about the sorting changes. Cheers, still working on it!
I opened a PR to this branch: https://github.com/1ec5/coronadatascraper/pull/2
As you said the source data changed, I think what I suggest is foolproof. :-)
Hi @1ec5 - are you good with my PR, can you merge it, and then we merge this one?
Closed, replaced by https://github.com/covidatlas/coronadatascraper/pull/1044.
:tada:
Summary
Replaced the scraper for the City and County of San Francisco with a new implementation that joins and pivots three datasets provided by the city’s DataSF portal in JSON format through an official API:
Previously, this project scraped the Department of Public Health’s COVID-19 landing page for the current statistics. The page didn’t provide any historical data, so this project would cache past days’ values of the current case total, and clients would tend to treat those values as the actual infection totals as of those days.
Now the project pulls in the full time series as revised by the department each day, which is optimal for charting infection rates. It also adds hospitalization and test counts.
Fixes #1011.
Changes
This implementation pivots each dataset on the date field, summing case counts and distinguishing between case count and death toll for the first dataset.
Additional notes