Open zaneselvans opened 1 year ago
Maybe converting the FERC data and running the PUDL ETL outside of pytest while collecting coverage is as easy as doing what we already do inside tox -e nuke
?
coverage run --append src/pudl/convert/ferc_to_sqlite.py src/pudl/package_data/settings/etl_fast.yml
coverage run --append src/pudl/cli/etl.py src/pudl/package_data/settings/etl_fast.yml
Running the ETL this way definitely pegs all my CPU cores so it's parallelized. The coverage output file was updated, and coverage
also has a parallel run mode that separates out the coverage reports by sub-process for later use with coverage combine
. Running the ETL script only took 10min 15sec compared to the 16min 29sec with the Dagster UI, so maybe there's less overhead doing it this way?
The coverage report doesn't look like it's capturing everything that's run, maybe because they aren't subprocesses? Just entirely different processes getting kicked off? This coverage looks like it's probably just the initial import of all the pudl modules.
coverage report --data-file .coverage.Urras.local.30314.786163
Name Stmts Miss Cover
-----------------------------------------------------------------------
src/pudl/analysis/allocate_gen_fuel.py 323 256 20.74%
src/pudl/analysis/classify_plants_ferc1.py 158 132 16.46%
src/pudl/analysis/epacamd_eia.py 18 12 33.33%
src/pudl/analysis/ferc1_eia.py 203 165 18.72%
src/pudl/analysis/mcoe.py 78 48 38.46%
src/pudl/analysis/plant_parts_eia.py 257 191 25.68%
src/pudl/analysis/service_territory.py 117 86 26.50%
src/pudl/analysis/spatial.py 112 96 14.29%
src/pudl/analysis/state_demand.py 241 196 18.67%
src/pudl/analysis/timeseries_cleaning.py 460 409 11.09%
src/pudl/cli/etl.py 46 9 80.43%
src/pudl/convert/censusdp1tract_to_sqlite.py 30 19 36.67%
src/pudl/convert/datasette_metadata_to_yml.py 21 12 42.86%
src/pudl/convert/epacems_to_parquet.py 34 23 32.35%
src/pudl/convert/metadata_to_rst.py 25 16 36.00%
src/pudl/etl/eia_bulk_elec_assets.py 9 4 55.56%
src/pudl/etl/epacems_assets.py 50 26 48.00%
src/pudl/etl/glue_assets.py 112 85 24.11%
src/pudl/etl/static_assets.py 24 10 58.33%
src/pudl/extract/dbf.py 255 179 29.80%
src/pudl/extract/eia860.py 63 35 44.44%
src/pudl/extract/eia860m.py 30 18 40.00%
src/pudl/extract/eia861.py 38 22 42.11%
src/pudl/extract/eia923.py 51 33 35.29%
src/pudl/extract/eia_bulk_elec.py 43 34 20.93%
src/pudl/extract/epacems.py 42 20 52.38%
src/pudl/extract/excel.py 130 93 28.46%
src/pudl/extract/ferc1.py 112 67 40.18%
src/pudl/extract/ferc714.py 20 11 45.00%
src/pudl/extract/xbrl.py 70 49 30.00%
src/pudl/glue/ferc1_eia.py 135 91 32.59%
src/pudl/helpers.py 389 314 19.28%
src/pudl/io_managers.py 230 163 29.13%
src/pudl/logging_helpers.py 13 0 100.00%
src/pudl/metadata/classes.py 825 353 57.21%
src/pudl/metadata/codes.py 4 0 100.00%
src/pudl/metadata/constants.py 21 0 100.00%
src/pudl/metadata/dfs.py 12 0 100.00%
src/pudl/metadata/enums.py 21 0 100.00%
src/pudl/metadata/fields.py 25 0 100.00%
src/pudl/metadata/helpers.py 183 115 37.16%
src/pudl/metadata/labels.py 12 0 100.00%
src/pudl/metadata/resources/allocate_gen_fuel.py 4 0 100.00%
src/pudl/metadata/resources/eia860.py 3 0 100.00%
src/pudl/metadata/resources/eia861.py 2 0 100.00%
src/pudl/metadata/resources/eia923.py 4 0 100.00%
src/pudl/metadata/resources/eia.py 4 0 100.00%
src/pudl/metadata/resources/eia_bulk_elec.py 2 0 100.00%
src/pudl/metadata/resources/epacems.py 3 0 100.00%
src/pudl/metadata/resources/ferc1.py 4 0 100.00%
src/pudl/metadata/resources/ferc714.py 3 0 100.00%
src/pudl/metadata/resources/glue.py 3 0 100.00%
src/pudl/metadata/resources/mcoe.py 4 0 100.00%
src/pudl/metadata/resources/pudl.py 3 0 100.00%
src/pudl/metadata/sources.py 5 0 100.00%
src/pudl/output/censusdp1tract.py 28 17 39.29%
src/pudl/output/eia860.py 17 9 47.06%
src/pudl/output/eia923.py 88 56 36.36%
src/pudl/output/eia.py 176 149 15.34%
src/pudl/output/eia_bulk_elec.py 14 8 42.86%
src/pudl/output/epacems.py 31 21 32.26%
src/pudl/output/ferc1.py 558 380 31.90%
src/pudl/output/ferc714.py 160 129 19.38%
src/pudl/output/pudltabl.py 112 91 18.75%
src/pudl/resources.py 17 7 58.82%
src/pudl/settings.py 253 41 83.79%
src/pudl/transform/classes.py 391 205 47.57%
src/pudl/transform/eia860.py 205 175 14.63%
src/pudl/transform/eia861.py 398 329 17.34%
src/pudl/transform/eia923.py 231 198 14.29%
src/pudl/transform/eia.py 295 243 17.63%
src/pudl/transform/eia_bulk_elec.py 26 21 19.23%
src/pudl/transform/epacems.py 34 22 35.29%
src/pudl/transform/ferc1.py 1361 950 30.20%
src/pudl/transform/ferc714.py 76 51 32.89%
src/pudl/transform/params/ferc1.py 67 0 100.00%
src/pudl/validate.py 348 199 42.82%
src/pudl/workspace/datastore.py 298 213 28.52%
src/pudl/workspace/resource_cache.py 122 65 46.72%
src/pudl/workspace/setup.py 73 34 53.42%
src/pudl/workspace/setup_cli.py 29 21 27.59%
-----------------------------------------------------------------------
TOTAL 10494 7026 33.05%
I love the idea of making the FERC extraction and PUDL ETL pre-pytest tasks, rather than expensive fixtures. For some reason when a multi process job is launch within pytest, pudl logs are swallowed and we only get junk dagster logs. Moving the ETL out of pytest and using --live-dbs should resolve this issue.
Our integration tests are painfully slow in CI, and are also currently running out of memory after running for about an hour, but we're on the verge of being able to change all that pretty dramatically. In the past we have been unable to effectively split up the tests and run them in parallel because:
PudlTabl
output object.Now we've got all but one of the output tables pulled into the DB, and will soon be able to deprecate
PudlTabl
and remove it from our tests entirely. Because all of the ETL steps are now being done within Dagster, that computation can be run in parallel very effectively -- the fast tests saturate the 10 cores on my laptop for the full duration of the tests, so we might even be able to use a 16 or 32CPU GitHub runner and get ~linear reductions in runtime.On a 10 core MacBookPro with 32GB of memory:
pytest --live-dbs test/integration
: 3min 21sec (2min 10sec of which is reading from the DB)