act-now-coalition / can-scrapers

MIT License
9 stars 13 forks source link

Add Prefect flow to generate one Parquet file per location #381

Closed clizzin closed 1 year ago

clizzin commented 2 years ago

@mikelehen Here's a proof of concept for using a Prefect flow to generate one Parquet file per location. Original flow for generating a single Parquet file here, for comparison: https://github.com/covid-projections/can-scrapers/blob/d24325185168d74c72c5dc8391ba62fd677ae05b/services/prefect/flows/update_api_view.py.

I'm opening this PR just for inspection, it's not production-ready. Things we'd have to do before deploying:

I poked around a bit in the Prefect UI but couldn't find monitoring for tasks' resource usage. Maybe you can help me with that?

Overall, this was pretty straightforward to write, and I liked Prefect's approach to dynamically spawning one task for each output of a prior task via .map. I noticed that the scraper flows are currently spawned off a static list, but I was able to write a flow that queries the PostgreSQL DB for a list of locations and then dynamically spawn a task for each location, which seems useful if we're going to start changing the set of locations.

My next task is to write a flow that performs our minimal pipeline for exploration purposes.