covidatlas / coronadatascraper

COVID-19 Coronavirus data scraped from government and curated data sources.
https://coronadatascraper.com
BSD 2-Clause "Simplified" License
363 stars 179 forks source link

Scrape Santa Clara County Power BI dashboards #1026

Closed 1ec5 closed 4 years ago

1ec5 commented 4 years ago

Summary

Replaced the deprecated scraper for Santa Clara County, California, with a new implementation that scrapes the county public health department’s Power BI dashboards.

Previously, there was a scraper that examined plain text figures on the dashboard website. But that scraper has been nonfunctional since the department transitioned to a series of Power BI dashboards.

The new scraper makes similar POST calls as the Power BI dashboards (reduced down to the bare minimum) and also fetches the UI to extract the date and count of undated cases.

Fixes #965.

Changes

The changes include some intimidating POST request bodies and works with what might appear to be obfuscated response JSON. However, the structure of the requests and responses should be reasonably stable and consistent across Power BI dashboards. Generally, only the queries differ between dashboards. The request bodies can’t be simplified any further, because Power BI refuses to serve any requests that don’t come with a specific payload.

Additional notes

The department only provides a time series for case counts and a limited time series for tests (which this scraper is ignoring for now). For the death toll and hospitalized patient count, the department provides only the current count. So the scraper only returns the death toll, hospitalized patient count, and test count when requesting the date the dashboards were updated; otherwise, it reports only the case count from the time series.

Additionally, the case count for the current date excludes undated cases when running timeseries, to ensure an accurate trend line for charting purposes, but includes undated cases when running other commands like start, for current comparisons with other jurisdictions.

jzohrab commented 4 years ago

Super, thanks @1ec5 !

This fails a few tests (/US/CA/santa-clara-county.js fetch coding conventions (line "fetch.json(this, '${this._urls.query}?newCases', 'newCases', false, {")) but that test is rather bogus -- I know, I wrote it :-) . Those tests were needed when I was writing cache migration, b/c I was regexing code, and likely we can get around the failures with a small change. I'll create a fix in a new branch and PR to this.

1ec5 commented 4 years ago

Ah, makes sense. Feel free to push directly to my branch, if that’s more convenient for you.

1ec5 commented 4 years ago

Note that the caching issue I’m seeing in https://github.com/covidatlas/coronadatascraper/pull/1022#issuecomment-632813012 may also apply here, or possibly to any scraper whose underlying time series data gets updated retroactively.

jzohrab commented 4 years ago

Ha, our prettify code actually ended up changing the code back to "breaking format" ... I'll disable those tests in this branch, they're not really needed anymore.

jzohrab commented 4 years ago

If I'm reading this correctly, it feels like we should do something slightly different for the testing.

@1ec5 , I'll make a PR to this branch from a separate one in my fork, because I'm not sure if what I'm suggesting is correct.

jzohrab commented 4 years ago

https://github.com/1ec5/coronadatascraper/pull/1 opened; @1ec5 please review. I believe it's clearer, and with those edits I was able to follow what you mentioned in the Additional Notes. Cheers and thanks!

jzohrab commented 4 years ago

(of course, my PR can go in at any time, if we wanted to merge your code as-is :-) )

1ec5 commented 4 years ago

I went ahead and merged 1ec5#1 – looks good to me.

jzohrab commented 4 years ago

Hi @1ec5 , your question in my PR to here:

If I understand correctly, this means that, if a past date is requested, the scraper would push currentResult with nothing other than the date, in which case it would lose out to the slightly more detailed result coming out of the time series data.

Right, it would just push that empty record. Perhaps we should add one final check to see if any fields were added. I didn't check the data, and actually couldn't quite follow what you meant with https://github.com/covidatlas/coronadatascraper/pull/1026/files#diff-54475cca719eeb2badaf7430a95aaba9R122.

1ec5 commented 4 years ago

I didn't check the data, and actually couldn't quite follow what you meant with https://github.com/covidatlas/coronadatascraper/pull/1026/files#diff-54475cca719eeb2badaf7430a95aaba9R122.

I put that bit in originally before devising a way to exclude undated cases from the latest day when generating a time series:

https://github.com/covidatlas/coronadatascraper/blob/92132a5eb406ad73442f900094f08e56b888530c/src/shared/scrapers/US/CA/santa-clara-county.js#L155-L156

That’s how the case count for the latest day in the time series was calculated anyways. So the cases-only result probably isn’t needed.

jzohrab commented 4 years ago

Thank you again @1ec5 for this work! I ran it again just now and got the below, which matches what is showing up on the tables.

✅ Data scraped!
   - 0 cities
   - 0 states
   - 1 counties
   - 0 countries
ℹ️  Total counts (tracked cases, may contain duplicates):
   - 2731 cases
   - 73486 tested
   - 0 recovered
   - 141 deaths
   - 0 active
jzohrab commented 4 years ago

Merged, super!