act-now-coalition / can-scrapers

MIT License
9 stars 13 forks source link

Decrease number of Prefect flows by excluding unnecessary scrapers #436

Closed smcclure17 closed 2 years ago

smcclure17 commented 2 years ago

Decreases the number of scrapers executed in the MainFlows from 114 to 75. The majority of the remaining scrapers are state-specific demographic vaccine scrapers.

This disables any scrapers not currently used further downstream in the pipeline, including those that don't collect timeseries data. So, if we want to re-activate those scrapers in the future we will be missing chunks of timeseries data, but I find it unlikely that we'll do so (and I don't think it's a huge deal considering the stable/stagnant nature of the vaccine data).

smcclure17 commented 2 years ago

do you know why we have ~12k flow runs but ~54k task runs?

Yeah, this is because each flow usually has 5-6 tasks each. We could safely remove the validate task from each since that doesn't actually do anything (we never got around to implementing that in a working/cohesive manner IIRC, so we just disabled the code in it)