catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Improve performance of integration tests on GitHub #2819

Open zaneselvans opened 10 months ago

zaneselvans commented 10 months ago

Our integration tests are painfully slow in CI, and are also currently running out of memory after running for about an hour, but we're on the verge of being able to change all that pretty dramatically. In the past we have been unable to effectively split up the tests and run them in parallel because:

Now we've got all but one of the output tables pulled into the DB, and will soon be able to deprecate PudlTabl and remove it from our tests entirely. Because all of the ETL steps are now being done within Dagster, that computation can be run in parallel very effectively -- the fast tests saturate the 10 cores on my laptop for the full duration of the tests, so we might even be able to use a 16 or 32CPU GitHub runner and get ~linear reductions in runtime.

On a 10 core MacBookPro with 32GB of memory:

- [x] Switch to a [4CPU/16GB GitHub runner](https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners) for integration test workflow to avoid running out of memory.
- [ ] Make FERC to SQLite its own pre-pytest task, rather than an expensive fixture
- [ ] Make building the PUDL DB its own pre-pytest task, rather than an expensive fixture.
- [ ] Make Dagster run the ETL in parallel using all CPUs rather than in-process.
- [ ] Collect coverage data from FERC to SQLite and PUDL ETL outside of pytest from parallel processes.
- [ ] Make the current `pudl_engine` fixture just return a connection engine pointing at the fresh DB.
- [ ] (optional) Use `pytest-xdist` or something similar to run the post-ETL integration tests in parallel. How effective this is will depend on speed of concurrent reads from SQLite.
- [ ] (optional) Restrict the years of data processed by `imputed_hourly_demand_ferc714` (by far the longest running asset, but also parallelized internally, so it may not end up controlling runtime)
- [ ] Identify the best size runner to use for the integration tests, while balancing cost against elapsed time.
zaneselvans commented 10 months ago

Maybe converting the FERC data and running the PUDL ETL outside of pytest while collecting coverage is as easy as doing what we already do inside tox -e nuke?

coverage run --append src/pudl/convert/ferc_to_sqlite.py src/pudl/package_data/settings/etl_fast.yml
coverage run --append src/pudl/cli/etl.py src/pudl/package_data/settings/etl_fast.yml

Running the ETL this way definitely pegs all my CPU cores so it's parallelized. The coverage output file was updated, and coverage also has a parallel run mode that separates out the coverage reports by sub-process for later use with coverage combine. Running the ETL script only took 10min 15sec compared to the 16min 29sec with the Dagster UI, so maybe there's less overhead doing it this way?

zaneselvans commented 10 months ago

The coverage report doesn't look like it's capturing everything that's run, maybe because they aren't subprocesses? Just entirely different processes getting kicked off? This coverage looks like it's probably just the initial import of all the pudl modules.

coverage report --data-file .coverage.Urras.local.30314.786163

Name                                               Stmts   Miss   Cover
-----------------------------------------------------------------------
src/pudl/analysis/allocate_gen_fuel.py               323    256  20.74%
src/pudl/analysis/classify_plants_ferc1.py           158    132  16.46%
src/pudl/analysis/epacamd_eia.py                      18     12  33.33%
src/pudl/analysis/ferc1_eia.py                       203    165  18.72%
src/pudl/analysis/mcoe.py                             78     48  38.46%
src/pudl/analysis/plant_parts_eia.py                 257    191  25.68%
src/pudl/analysis/service_territory.py               117     86  26.50%
src/pudl/analysis/spatial.py                         112     96  14.29%
src/pudl/analysis/state_demand.py                    241    196  18.67%
src/pudl/analysis/timeseries_cleaning.py             460    409  11.09%
src/pudl/cli/etl.py                                   46      9  80.43%
src/pudl/convert/censusdp1tract_to_sqlite.py          30     19  36.67%
src/pudl/convert/datasette_metadata_to_yml.py         21     12  42.86%
src/pudl/convert/epacems_to_parquet.py                34     23  32.35%
src/pudl/convert/metadata_to_rst.py                   25     16  36.00%
src/pudl/etl/eia_bulk_elec_assets.py                   9      4  55.56%
src/pudl/etl/epacems_assets.py                        50     26  48.00%
src/pudl/etl/glue_assets.py                          112     85  24.11%
src/pudl/etl/static_assets.py                         24     10  58.33%
src/pudl/extract/dbf.py                              255    179  29.80%
src/pudl/extract/eia860.py                            63     35  44.44%
src/pudl/extract/eia860m.py                           30     18  40.00%
src/pudl/extract/eia861.py                            38     22  42.11%
src/pudl/extract/eia923.py                            51     33  35.29%
src/pudl/extract/eia_bulk_elec.py                     43     34  20.93%
src/pudl/extract/epacems.py                           42     20  52.38%
src/pudl/extract/excel.py                            130     93  28.46%
src/pudl/extract/ferc1.py                            112     67  40.18%
src/pudl/extract/ferc714.py                           20     11  45.00%
src/pudl/extract/xbrl.py                              70     49  30.00%
src/pudl/glue/ferc1_eia.py                           135     91  32.59%
src/pudl/helpers.py                                  389    314  19.28%
src/pudl/io_managers.py                              230    163  29.13%
src/pudl/logging_helpers.py                           13      0 100.00%
src/pudl/metadata/classes.py                         825    353  57.21%
src/pudl/metadata/codes.py                             4      0 100.00%
src/pudl/metadata/constants.py                        21      0 100.00%
src/pudl/metadata/dfs.py                              12      0 100.00%
src/pudl/metadata/enums.py                            21      0 100.00%
src/pudl/metadata/fields.py                           25      0 100.00%
src/pudl/metadata/helpers.py                         183    115  37.16%
src/pudl/metadata/labels.py                           12      0 100.00%
src/pudl/metadata/resources/allocate_gen_fuel.py       4      0 100.00%
src/pudl/metadata/resources/eia860.py                  3      0 100.00%
src/pudl/metadata/resources/eia861.py                  2      0 100.00%
src/pudl/metadata/resources/eia923.py                  4      0 100.00%
src/pudl/metadata/resources/eia.py                     4      0 100.00%
src/pudl/metadata/resources/eia_bulk_elec.py           2      0 100.00%
src/pudl/metadata/resources/epacems.py                 3      0 100.00%
src/pudl/metadata/resources/ferc1.py                   4      0 100.00%
src/pudl/metadata/resources/ferc714.py                 3      0 100.00%
src/pudl/metadata/resources/glue.py                    3      0 100.00%
src/pudl/metadata/resources/mcoe.py                    4      0 100.00%
src/pudl/metadata/resources/pudl.py                    3      0 100.00%
src/pudl/metadata/sources.py                           5      0 100.00%
src/pudl/output/censusdp1tract.py                     28     17  39.29%
src/pudl/output/eia860.py                             17      9  47.06%
src/pudl/output/eia923.py                             88     56  36.36%
src/pudl/output/eia.py                               176    149  15.34%
src/pudl/output/eia_bulk_elec.py                      14      8  42.86%
src/pudl/output/epacems.py                            31     21  32.26%
src/pudl/output/ferc1.py                             558    380  31.90%
src/pudl/output/ferc714.py                           160    129  19.38%
src/pudl/output/pudltabl.py                          112     91  18.75%
src/pudl/resources.py                                 17      7  58.82%
src/pudl/settings.py                                 253     41  83.79%
src/pudl/transform/classes.py                        391    205  47.57%
src/pudl/transform/eia860.py                         205    175  14.63%
src/pudl/transform/eia861.py                         398    329  17.34%
src/pudl/transform/eia923.py                         231    198  14.29%
src/pudl/transform/eia.py                            295    243  17.63%
src/pudl/transform/eia_bulk_elec.py                   26     21  19.23%
src/pudl/transform/epacems.py                         34     22  35.29%
src/pudl/transform/ferc1.py                         1361    950  30.20%
src/pudl/transform/ferc714.py                         76     51  32.89%
src/pudl/transform/params/ferc1.py                    67      0 100.00%
src/pudl/validate.py                                 348    199  42.82%
src/pudl/workspace/datastore.py                      298    213  28.52%
src/pudl/workspace/resource_cache.py                 122     65  46.72%
src/pudl/workspace/setup.py                           73     34  53.42%
src/pudl/workspace/setup_cli.py                       29     21  27.59%
-----------------------------------------------------------------------
TOTAL                                              10494   7026  33.05%
bendnorman commented 10 months ago

I love the idea of making the FERC extraction and PUDL ETL pre-pytest tasks, rather than expensive fixtures. For some reason when a multi process job is launch within pytest, pudl logs are swallowed and we only get junk dagster logs. Moving the ETL out of pytest and using --live-dbs should resolve this issue.