catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Get DaskExecutor working locally. #1216

Closed bendnorman closed 2 years ago

bendnorman commented 2 years ago

I was able to easily run the ETL using the LocalExecutor. docker-compose up creates two dask workers and runs the ETL using the DaskExecutor. This ran but got hung up on various parts of the ETL and produced some dask warnings about tasks holding onto process locks for too long.

bendnorman commented 2 years ago

Running etl_fast.yml using the DaskExecutor produced a couple of

 FileNotFoundError: [Errno 2] No such file or directory: '/pudl/outputs/cache/2021-09-14-2222-a1c3c13c-fd40-4aa3-8814-c25d0fcf88e6/dataframes/0cdfb9fa15ac11eca3f90242ac120003/boiler_fuel_eia923'

errors for different tables. I checked the prefect flow chart and the tasks seem to be executed in the correct order which makes me think this is an issue when caching with multiple processes.

Luckily running the fast etl using the LocalDaskExecutor works! This executor works on a single node and should provide a speed up so it is good enough for this first iteration of cloudification. We will need to use the DaskExecutor if we want multi-node execution.