covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Santa Clara County, California #375

Closed 1ec5 closed 4 years ago

1ec5 commented 4 years ago

Added a source for Santa Clara County, California, that uses the following datasets from the county public health department via the county’s open data portal:

The case count dataset lags a few days behind the public health department’s Power BI dashboard. Not only is it missing the latest few days, which the dashboard deemphasizes as being preliminary, but it also doesn’t reflect the latest revisions to scores of past dates. For example, this scraper reports 9,655 cases on July 25. That was what the county attributed to July 25 as of July 31, but the number has since been revised upward to 10,044. Still, it’s more accurate than the 8,833 that the Mercury News scraper comes up with, based on what the county reported for July 25 as of July 25.

covidatlas/coronadatascraper#1026 scraped the dashboard but was more fragile for that reason. Perhaps we could offer both the open data portal and the Power BI dashboard as complementary datasets: one more up-to-date, the other more durable.

This PR also contains a small correction to the sample source.

1ec5 commented 4 years ago

Santa Clara County also provides breakdowns by city and by ZIP code, but I’m unsure how to format the scraper’s return value. These breakdowns are only snapshots in time, not time series.

jzohrab commented 4 years ago

Thanks very much!

Re "Perhaps we could offer both the open data portal and the Power BI dashboard as complementary datasets: one more up-to-date, the other more durable." -- yes, one thing I'm considering for v2 reports is that we just give consumers all the data we have, as-is, in addition to providing some combined data sources like we do right now in v1. Consumers can choose if/how they want to combine everything.

We also have the idea of "multivalent data" in this code, so when we report we can combine different data sources. What this means is that we can keep the individual sources much simpler, we don't have to do things like "joining" the data sources like you've done here (and I've done in the past).

fyi there is a "sources/_lib/timeseries-filter.js" module that you might find useful in future. You don't need to change this, unless you want to. 👍

I don't have time and data bandwidth to validate, I'll assume you've taken care of that. :-D

Cheers, thanks!