Migrate NYT and HHS sources from covid-data-public to can-scrapers

act-now-coalition / covid-data-model

Data backend providing computed data for the graphs displayed at https://covidactnow.org

https://covidactnow.org/

MIT License

149 stars 57 forks source link

Migrate NYT and HHS sources from covid-data-public to can-scrapers #1218

Closed smcclure17 closed 2 years ago

smcclure17 commented 2 years ago

snapshot 299 was generated off this branch and appears to have not caused any unwanted changes

I think the test_pyseir_end_to_end_idaho test is failing because the Idaho case data is blocked, which means there's no data to calculate Rt with (it passes if I substitute a different FIPS like Los Angeles, 06037)

smcclure17 commented 2 years ago

Yeah, good points,

This is far from a good solution, but in the immediate term how would you feel about a plan to:

Add a schedule to the NYT scraper to run daily at 10:30 PM PST (and count on the NYT repo to update at 10:10 consistently). This scraper takes about an hour to complete so,
Schedule the Parquet updater to run daily at 11:40 PM PST. This takes ~15 minutes.
Schedule a covid-data-model run at 12:00-12:15 AM PST

In theory, this gets a snapshot out at relatively the same time, but it's very fragile as if anything is late/takes longer than expected it fails.

As a side note, almost all of the time taken by the large scrapers (like the NYT one) is during the insert step (the put() task). I haven't looked at that code much at all, but I wonder if there's a way we could speed this up.

mikelehen commented 2 years ago

Hrm, yeah. I'm surprised the NYT scraper takes an hour to insert its data. Might be worth looking into that at some point. One thing I've wondered is if Postgres is the best option for us. Given we're not doing any low-latency, optimized queries or anything, would BigQuery fit our needs better? Not at all sure but it might be worth looking into.

In any case, your plan sounds good to me!

smcclure17 commented 2 years ago

Yeah, I agree that there are probably better options for this--worth looking into!

Some changes:

142b5da22fa689a000251d6742fe6b1b0b87d181 adds a scheduler to run the pipeline daily and sets the default behavior to trigger the API build.

cbea414f8b9fb5a31be11c6d1643391f4e60f242 persists the parquet file and sets the pipeline to read the file from local storage. This is mainly to give us visibility into what Parquet file was used for each snapshot. I went this route mostly because it didn't require making changes in can-scrapers (such as adding a log to GCS to denote the most recent timestamped file).

I'm more than willing to update this to pull from the most recent timestamped file in GCS and log the name of the file (per our slack conversation) if you'd prefer that.