catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Observe nightly build CPU usage & identify parallelism bottlenecks #2444

Open jdangerx opened 1 year ago

jdangerx commented 1 year ago

In interminable sessions of watching nightly build output I've noticed that the CPU usage is often pegged at ~12.5% or 100% of one core.

GCloud metrics agree:

image

We're paying for 8 cores, we should use 'em!

In theory, dagster should be parallelizing asset materializations, but we definitely have some assets that run long (like eia transform, for example) and aren't parallelized.

It'd be nice to measure this to see what the heck is going on

This is kind of a weird "explore / investigate" issue, so it's kind of tricky to scope, but I think "having a way to look at some basic operational metrics" is probably a good place to start.

### Scope
- [ ] we can see CPU usage over time & correlate with asset materializations/op executions
- [ ] we persist op execution times & CPU/memory metrics for each run
### Next steps
- [ ] add instrumentation? there's prometheus integration, but we don't run prometheus yet...
- [ ] write a script that parses the dagster log output for the op times? this is probably actually the thing
- [ ] compare the op times with the gcloud CPU metrics
zaneselvans commented 1 year ago

Even after we get CEMS parallelized in #2376 (which could shave off up to an hour of time if we max out the CPUs) I think the majority of this long, slow, 1 CPU doldrum is going to remain, because it's running the integration tests against the full DB, and then running the data validations. Those two test environments are run serially right now, but they both access the live DB and could certainly run in parallel with little to no refactoring. I think that would shave off about 1 hour, since that's how long the data validations take.

Moving the validations into the main codebase and out of the tests could have multiple advantages -- if they're done closer to where the data is being generated in the DAG, then we'd have much more specificity about where errors are cropping up, and they could also take advantage of the parallelism that Dagster offers.

Once all of the output tables & analyses that the full integration tests are currently generating or running are in the DAG, a lot of that work will also automatically be parallelized. I think checking that the outputs are how we expect them to be is a lot less work than generating them in the first place. This should substantially slim down the work that has to happen in the test phase, or as with the validations, we could move a lot of those checks directly into the DAG, close to where the data processing / analysis is happening.

One problem that I think we'll have to address pretty soon, as all this processing gets parallelized, is our memory footprint. For some reason right now we have these very brief spikes to more than 60GB of memory usage that I don't think we understand. Other than those spikes there are a couple of times when memory usage gets up to ~40GB, one of which is the FERC plant ID assignment process (stacked on top of whatever other Dagster assets are processing)

Image