catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Improve ergonomics & performance of pudl.etl DAG #2386

Open zaneselvans opened 1 year ago

zaneselvans commented 1 year ago

Our initial PUDL ETL DAG has a couple of clear performance bottlenecks, and a few assets or asset groups with more dependencies than are really necessary. Modest refactoring of some of these assets or asset groups could significantly reduce how long it takes the ETL to run, and make the actual dependencies between assets clearer.

There are also a few tweaks we can make to our IO Managers and other Dagster abstractions that will make the development / migration process smoother.

- [ ] #2417
- [ ] #2387
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2431
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2264
- [ ] #2376
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2377
- [ ] #2385
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2468
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2293
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2444
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2470
- [ ] #216
- [ ] #475
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2666
jdangerx commented 1 year ago

I'm excited about these (proposed) changes! I wonder if it makes sense to set specific goals around speedup & ergonomic improvements, so we know when the performance is "good enough" - I could see myself falling down quite the rabbit hole here if we don't do that ahead of time...

If we do set goals, we should be clear about whether we're targeting single-core/dual-core vs. many-core performance (thinking of our poor GH Actions runners... though if our many-core performance is good enough we might be able to just kick off some workloads in GCP to do the heavy lifting.).

zaneselvans commented 1 year ago

I mean... I think it's good enough as it is! It's already waaay faster than before. But I think the big 3 parallelizations (CEMS, Excel reads, disentangling EIA transform) will really help make local development feel more responsive (since many of us have 8-10 core) and will make running full nightly builds on an 8CPU instance take... an hour? Which will save a small amount of money, but maybe more importantly let us know more about the state of the code -- not just on dev and special occasions. I don't think these 3 improvements will be very complex to implement.

But beyond that I think we should focus on getting to where we can distribute just data and not software, with the usability improvements listed above helping to make that work more pleasant and productive. @bendnorman and I outlined several groups of assets that we need to bring in (not even all of them...) and it is, uh, kind of a lot. Hopefully once we get in the groove it's a turn-the-crank kind of thing, but even still it'll take a while.

zaneselvans commented 1 year ago

Of the 3 big performance bottlenecks I'd like to take on #2387 if that's okay.