catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Why did our integration tests get so slow? #2494

Closed zaneselvans closed 1 year ago

zaneselvans commented 1 year ago

Working on like 5 PRs simultaneously today, I had to dig into why our integration tests have suddenly gotten sooo slooow.

Running the same command as the tests use:

tox -e integration -- --gcs-cache-path=gs://zenodo-cache.catalyst.coop --durations 0

the integration test took 45.5 minutes for me locally (including 85 seconds for environment setup). By contrast, the last time I ran this test in CI on a PR today, it took 2 hours and 11 minutes or 131 minutes -- nearly 3x as long!

There's no parallelism in the CI runners, so my having 10 CPUs doesn't seem like it should matter. I guess my M1 Pro chip is faster even on a single core, but I'm still wondering if something got messed up with input file caching, or we're accidentally building the whole DB multiple times now. The integration tests really shouldn't take this long!

The --durations 0 shows how long the various tests take at the end of the test run, and there are two tests that are taking wildly longer than they should, and the thing that's taking forever is the setup for the test, not actually running the tests. The setup is when either files would get downloaded or the database would be created as a fixture (running the whole ETL) if the tests couldn't find it, so something seems at least a little bit wrong. The DB obviously does have to get built at least once, and I imagine that the extremely long setup time for test_epacems_subset is probably because that's the first test that gets run which depends on the PUDL DB?

Slowest tests locally:

Slowest tests in CI on GitHub:

How long does it take to build the PUDL DB in CI?

Locally, running the test that just builds the DB:

pytest test/integration/etl_test.py::test_pudl_engine

only takes 5.5 minutes (327 seconds) to run, so it seems like something is wonky with the CI somehow.

Looking through the logs for the last integration test run, it looks like it is re-downloading all of the data from GCP, rather than using the cache of data that should be available on the GitHub runner. There are 45 instances of "not optimally cached" inputs:

Expand to see the files that got downloaded ``` 2023-04-04T20:44:46.2891951Z 2023-04-04 20:44:46 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(ferc1/10.5281/zenodo.7314437/ferc1-2020.zip) was not optimally cached yet, adding. 2023-04-04T20:45:51.6478108Z 2023-04-04 20:45:51 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(ferc1/10.5281/zenodo.7314437/ferc1-xbrl-2021.zip) was not optimally cached yet, adding. 2023-04-04T20:54:04.4130407Z 2023-04-04 20:54:04 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(ferc2/10.5281/zenodo.7130128/ferc2-xbrl-2021.zip) was not optimally cached yet, adding. 2023-04-04T20:57:53.0289333Z 2023-04-04 20:57:53 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(ferc6/10.5281/zenodo.7130141/ferc6-xbrl-2021.zip) was not optimally cached yet, adding. 2023-04-04T21:01:55.9620473Z 2023-04-04 21:01:55 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(ferc60/10.5281/zenodo.7130146/ferc60-xbrl-2021.zip) was not optimally cached yet, adding. 2023-04-04T21:03:04.1870912Z 2023-04-04 21:03:04 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(ferc714/10.5281/zenodo.7139875/ferc714-xbrl-2021.zip) was not optimally cached yet, adding. 2023-04-04T21:06:42.7682345Z 2023-04-04 21:06:42 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia860/10.5281/zenodo.7113854/eia860-2020.zip) was not optimally cached yet, adding. 2023-04-04T21:06:44.9281185Z 2023-04-04 21:06:44 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia860/10.5281/zenodo.7113854/eia860-2021.zip) was not optimally cached yet, adding. 2023-04-04T21:09:43.3576618Z 2023-04-04 21:09:43 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia860m/10.5281/zenodo.7320218/eia860m-2022-09.xlsx) was not optimally cached yet, adding. 2023-04-04T21:10:14.0325537Z 2023-04-04 21:10:14 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2007.zip) was not optimally cached yet, adding. 2023-04-04T21:10:14.4878340Z 2023-04-04 21:10:14 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2008.zip) was not optimally cached yet, adding. 2023-04-04T21:10:14.9420595Z 2023-04-04 21:10:14 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2009.zip) was not optimally cached yet, adding. 2023-04-04T21:10:15.3676401Z 2023-04-04 21:10:15 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2010.zip) was not optimally cached yet, adding. 2023-04-04T21:10:15.7763837Z 2023-04-04 21:10:15 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2011.zip) was not optimally cached yet, adding. 2023-04-04T21:10:16.2328866Z 2023-04-04 21:10:16 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2012.zip) was not optimally cached yet, adding. 2023-04-04T21:10:16.7181774Z 2023-04-04 21:10:16 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2013.zip) was not optimally cached yet, adding. 2023-04-04T21:10:17.4665692Z 2023-04-04 21:10:17 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2014.zip) was not optimally cached yet, adding. 2023-04-04T21:10:18.1424475Z 2023-04-04 21:10:18 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2015.zip) was not optimally cached yet, adding. 2023-04-04T21:10:20.9750294Z 2023-04-04 21:10:20 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2016.zip) was not optimally cached yet, adding. 2023-04-04T21:10:24.2453037Z 2023-04-04 21:10:24 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2017.zip) was not optimally cached yet, adding. 2023-04-04T21:10:27.1814696Z 2023-04-04 21:10:27 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2018.zip) was not optimally cached yet, adding. 2023-04-04T21:10:30.2165307Z 2023-04-04 21:10:30 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2019.zip) was not optimally cached yet, adding. 2023-04-04T21:10:34.0278310Z 2023-04-04 21:10:34 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2020.zip) was not optimally cached yet, adding. 2023-04-04T21:10:37.5927674Z 2023-04-04 21:10:37 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2021.zip) was not optimally cached yet, adding. 2023-04-04T21:10:40.7773898Z 2023-04-04 21:10:40 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2001.zip) was not optimally cached yet, adding. 2023-04-04T21:10:41.1191895Z 2023-04-04 21:10:41 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2002.zip) was not optimally cached yet, adding. 2023-04-04T21:10:41.5024609Z 2023-04-04 21:10:41 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2003.zip) was not optimally cached yet, adding. 2023-04-04T21:10:41.8946760Z 2023-04-04 21:10:41 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2004.zip) was not optimally cached yet, adding. 2023-04-04T21:10:42.2807816Z 2023-04-04 21:10:42 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2005.zip) was not optimally cached yet, adding. 2023-04-04T21:10:42.6814711Z 2023-04-04 21:10:42 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia861/10.5281/zenodo.7191809/eia861-2006.zip) was not optimally cached yet, adding. 2023-04-04T21:13:08.7756045Z 2023-04-04 21:13:08 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia923/10.5281/zenodo.7236677/eia923-2020.zip) was not optimally cached yet, adding. 2023-04-04T21:13:24.0398626Z 2023-04-04 21:13:24 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia923/10.5281/zenodo.7236677/eia923-2021.zip) was not optimally cached yet, adding. 2023-04-04T21:15:48.8645605Z 2023-04-04 21:15:48 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(ferc714/10.5281/zenodo.7139875/ferc714.zip) was not optimally cached yet, adding. 2023-04-04T21:16:01.3936177Z 2023-04-04 21:16:01 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia_bulk_elec/10.5281/zenodo.7067367/eia_bulk_elec.zip) was not optimally cached yet, adding. 2023-04-04T21:39:53.9709802Z 2023-04-04 21:39:53 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(epacamd_eia/10.5281/zenodo.7650939/epacamd_eia.zip) was not optimally cached yet, adding. 2023-04-04T21:39:55.5640294Z 2023-04-04 21:39:55 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(epacems/10.5281/zenodo.6910058/epacems-2019-id.zip) was not optimally cached yet, adding. 2023-04-04T21:39:58.2138417Z 2023-04-04 21:39:58 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(epacems/10.5281/zenodo.6910058/epacems-2019-me.zip) was not optimally cached yet, adding. 2023-04-04T21:40:00.6899285Z 2023-04-04 21:40:00 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(epacems/10.5281/zenodo.6910058/epacems-2020-id.zip) was not optimally cached yet, adding. 2023-04-04T21:40:02.7932351Z 2023-04-04 21:40:02 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(epacems/10.5281/zenodo.6910058/epacems-2020-me.zip) was not optimally cached yet, adding. 2023-04-04T21:40:05.2868402Z 2023-04-04 21:40:05 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(epacems/10.5281/zenodo.6910058/epacems-2021-id.zip) was not optimally cached yet, adding. 2023-04-04T21:40:07.3993695Z 2023-04-04 21:40:07 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(epacems/10.5281/zenodo.6910058/epacems-2021-me.zip) was not optimally cached yet, adding. 2023-04-04T21:40:38.4285618Z 2023-04-04 21:40:38 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(ferc1/10.5281/zenodo.7314437/ferc1-2018.zip) was not optimally cached yet, adding. 2023-04-04T21:40:40.4291036Z 2023-04-04 21:40:40 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia860/10.5281/zenodo.7113854/eia860-2018.zip) was not optimally cached yet, adding. 2023-04-04T21:40:41.4876971Z 2023-04-04 21:40:41 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(eia923/10.5281/zenodo.7236677/eia923-2018.zip) was not optimally cached yet, adding. 2023-04-04T22:18:21.6309463Z 2023-04-04 22:18:21 [ INFO] catalystcoop.pudl.workspace.datastore:366 Resource(censusdp1tract/10.5281/zenodo.4127049/censusdp1tract-2010.zip) was not optimally cached yet, adding. ```

And in the early setup part of the run there's:

2023-04-04T20:42:08.0923626Z Cache not found for input keys: zenodo-datastore-26acc897707a24cca6ee674904732c55cdb33dd456ba90198d78c400dfaad872

The test that appears to be kicking off the creation of the DB is Datasette metadata to YML, which is unexpected (to me at least). This metadata conversion requires access to some of the datapackage.json files from the Zenodo archive, but I don't think it needs the DB. But maybe I'm wrong:

test/integration/datasette_metadata_test.py::test_datasette_metadata_to_yml 
-------------------------------- live log setup --------------------------------
2023-04-04 20:44:43 [ WARNING] catalystcoop.pudl.workspace.setup:172 pudl_settings is being deprecated in favor of environment variables PUDL_OUTPUT and PUDL_INPUT. For more info see: https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html
2023-04-04 20:44:43 [ WARNING] catalystcoop.pudl.workspace.setup:193 sqlite and parquet directories are no longer being used. Make sure there is a single directory named 'output' at the root of your workspace. For more info see: https://catalystcoop-pudl.readthedocs.io/en/dev/dev/dev_setup.html
2023-04-04 20:44:43 [    INFO] test.conftest:93 PUDL_OUTPUT path: /tmp/pytest-of-runner/pytest-0/pudl0/output
2023-04-04 20:44:43 [    INFO] test.conftest:94 PUDL_INPUT path: /home/runner/pudl-work/data

What happens if I run

tox -e integration -- --durations 0

without telling it to look at the GCS cache locally? No substantial difference in time, and it still finds all of the cached inputs in the local filesystem without resorting to GCS or Zenodo.

So maybe we just have a problem with the GitHub cache? Is the path for PUDL_INPUT different from our old cache location? Looking at the log output, it seems to be looking for the cache in the right place:

2023-04-04T20:42:07.7930088Z ##[group]Run actions/cache@v3
2023-04-04T20:42:07.7930380Z with:
2023-04-04T20:42:07.7930617Z   path: /home/runner/pudl-work/data/
2023-04-04T20:42:07.7931016Z   key: zenodo-datastore-26acc897707a24cca6ee674904732c55cdb33dd456ba90198d78c400dfaad872

And then after this particular job at least, it seems to be trying to save a new cache with the same hash:

2023-04-04T22:53:18.5076888Z Post job cleanup.
2023-04-04T22:53:18.7139599Z [command]/usr/bin/tar --posix -cf cache.tzst --exclude cache.tzst -P -C /home/runner/work/pudl/pudl --files-from manifest.txt --use-compress-program zstdmt
2023-04-04T22:53:52.2949515Z Cache Size: ~1079 MB (1131914152 B)
2023-04-04T22:53:52.4693569Z Cache saved successfully
2023-04-04T22:53:52.6583134Z Cache saved with key: zenodo-datastore-26acc897707a24cca6ee674904732c55cdb33dd456ba90198d78c400dfaad872

I still don't understand why the Datasette metadata conversion would be taking so long though. Ah, running just that test results in a full run of ferc_to_sqlite for some reason! That seems like it's probably incorrect, and it takes 2.5 minutes locally, so probably 5-10 minutes on GitHub.

Slowness Conclusions:

zaneselvans commented 1 year ago

@jdangerx this was driving me kind of nuts today. I'll keep an eye on the caching in future runs and see if they keep missing.

It looks like test_datasette_metadata_to_yml does actually depend on the XBRL extraction, so I guess that's just the test that has the honor of kicking of the XBRL extraction.

zaneselvans commented 1 year ago

Going to close this because it looks like our CI really is this slow!