catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Profile PUDL #1172

Open bendnorman opened 2 years ago

bendnorman commented 2 years ago

Profiling PUDL ETL will provide a nice performance baseline for the parallelization project and hopefully give us some insight into the slow parts of PUDL.

bendnorman commented 2 years ago

@TrentonBush did preliminary profiling of the ETL and found that writing PUDL and CEMS to csvs takes foreeever. Ripping out datapackages is being handled in #1176.

As we work on this Epic we should:

zaneselvans commented 2 years ago

Oooh, it would be great to have profiling as part of the CI so we can catch anything that slows it down, or see how much we've sped it up as we go, instead of wondering how it got so slow over time after the fact.

zaneselvans commented 2 years ago

For the new SQL/Parquet outputs I've got this little script:

#!/usr/bin/bash

pyroscope exec --log-level info --spy-name pyspy \
    --application-name pudl ferc1_to_sqlite --clobber $1 && \
pyroscope exec --log-level info --spy-name pyspy \
    --application-name pudl pudl_etl --clobber $1

But when I run it I get some errors about it not being able to communicate with the server...

agent/session.go:204 Failed to copy PyInterpreterState from process
agent/session.go:204 error taking snapshot: Failed to copy filename

Although the server is running and seems to be happily profiling itself.

I ran the full ETL not including the ferc1_to_sqlite step and without any parallelism it took 2 hours. ~40 min in the FERC+EIA ETL and 1h20m for the Parquet. Huge blocks of time in both of them are dedicated to reading the original raw data from Excel / CSV, both of which are embarrassingly parallelizable. I think that's already part of the Prefect PR for EPA CEMS, but maybe we should look at extending the use of Prefect to reading the raw Excel data in as well.

bendnorman commented 2 years ago

Versions to profile:

jdangerx commented 1 year ago

Resurrecting this to talk a bit about performance - I ran the devtools/pyspy_pudl_etl.sh and got these profile results, which we can see on speedscope.app pyspy_pudl_etl_2023.tar.gz (for some reason GH doesn't like json attachments so I tar'ed it)

Looks like there are 3 slow things:


Executed in 293.99 millis fish external usr time 211.43 millis 83.00 micros 211.35 millis sys time 81.30 millis 533.00 micros 80.77 millis


* writing to sqlite in load step  (0:22/3:51)- sounds like other people have encountered this, and we might be able to bypass pandas for some speed too: https://stackoverflow.com/questions/51602216/speeding-up-performance-when-writing-from-pandas-to-sqlite
zaneselvans commented 1 year ago

read_excel is indeed horrendously slow, and right now we are reading in a bunch of Excel files serially, when we could be reading them all in parallel. This wouldn't speed up the fast ETL (since it's just looking at a single year) but for the full ETL (which reads in 20 years i.e. 20 spreadsheets) it could make a big difference.

The pandas infrastructure for reading Excel had some big changes last year, and xlrd no longer works except for the case of old Excel file formats that aren't supported by openpyxl at all.

I think converting the spreadsheets to CSVs is probably a recipe for pain. It would only be advantageous if we then stored and referred back to the CSVs rather than starting from the original data, and every format transformation seems to come with its own data type and structural headaches.

When we're using Dagster, in theory it will be possible to start the pipeline over from intermediate steps, so the first extracted dataframes rather than going all the way back to the spreadsheets, but @bendnorman would know better whether that is going to actually be helpful for us. Probably not in a CI context I'd guess.

It looks like there's a thin python wrapper for ripgrep.

The ETL that happens in the CI isn't so bad. The real timesuck is all of the output & analysis functions, which in many cases are doing SQL-ish things but in pandas. I'm curious to see how much time we save by switching to using database views instead, and also storing the more complex analysis outputs in the DB directly. Some of them get regenerated multiple times in the course of the CI now, and if they were in effect cached in the DB when they're first generated, they would just be reused directly when some other output depends on them downstream.

bendnorman commented 1 year ago

I agree with everything Zane flagged here.

read_excel seems like it will forever be very slow. With dagster we'll be able to run portions of the ETL that are downstream of the the read_excel steps during development. The data will be loaded from a pickle file or a database. We might be able to set up our DAG so portitions of it are run based on code changes and data freshness which would allow us to occasionally skip the read_excel step in our CI/CD. I'd need to look into this more though.

I'm pretty sure there are some looong running functions in the output and analysis modules though I haven't profiled them. Database views might speed up the validation tests.