catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Consider adding --rerun flag #888

Closed rousik closed 3 years ago

rousik commented 3 years ago

Right now every ETL is effectively run from scratch. With prefect we can gain caching benefits where task results and potentially pre-calculated data frames/tables can be stored. Right now, the caching is somewhat crude and you can either turn it on or off but there's very little control about how different runs will interact with the cache.

I'm thinking that perhaps the simplest approach (from ease-of-use perspective) would be to introduce --rerun $run_uuid flag, which will attempt to resume previous run identified by the $run_uuid.

If this flag is not set, ETL would generate random uuid every time it is run and will point cache directories to some $cache_root/$uuid. When --rerun is set, the caching directories will point to the same place and pick up whatever has been cached by the previous run.

Right now, prefect result caching is using default path .prefect/results and DataFrameCollection caches under $pudl_in/prefect-task-cache. Ideally, both of these would point to a similar location (either local disk or gcs path), perhaps $cache_root/$run_uuid/prefect-results and $cache_root/$run_uuid/data-frame-cache.

We may also consider storing the settings.yml file here as well such that --rerun won't need to know what was used with the first run.

Using local disk for caching carries risk of using up all the available disk space. We could consider running cache cleanup before the ETL is started (e.g. wipe everything that has been modified before --max-cache-age or something like that)

zaneselvans commented 3 years ago

The idea here is to use it just for development / testing purposes right? Like we try and do a run, and it dies, and we change some code to fix the thing that broke, and start over from the last saved checkpoint? There isn't any way for us to know what code / data has changed upstream if everything is in the same repository is there?

I've been wondering whether it would eventually make sense to have each input dataset in its own repository as a kind of "plugin" for a skeletal PUDL ETL application. In which case we would know which specific code bases had changed upstream, and be able to re-use the cached outputs if nothing upstream had changed. But also this could just be delusions of grandeur on my part.

rousik commented 3 years ago

Yep, this is simply to have a simple user-interface to pipeline caching. If you run ETL and it fails, you can run it with --rerun specifying the same run_id and the pipeline will automatically pick up where it left off last time. This is probably going to be useful for both development and potentially recovering from production failures during data releases.

rousik commented 3 years ago

This is now implemented in the prefect branch.