Closed zaneselvans closed 1 year ago
@katie-lamb I updated the CI and package requirements to only ever use Python 3.10 (and to use the new bot automerge PR workflows) so hopefully your 3.8/3.9 dependency resolution issues can be ignored.
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
What's the max memory usage of the tests at this point? Is it close to 7GB? I imagine the runner uses some memory on its own just to exist and have an OS, and even if the tests themselves were a bit under 7GB, the sum of the overhead and the tests could still be too much.
When I run it locally the max memory usage is 5.7 GB (when the plant parts list is generated in test_ppl_out
). The fact that it gets through test_ppl_out
and fails during test_ferc1_to_eia
maybe means it's not cleaning up memory as quickly as when I run it locally? For example maybe the plant part list should be explicitly cleaned up after test_deprish_to_eia_out
.
Maybe this already came up, but I'm sure there's some overhead for the test runner itself beyond just our Python code, and I suspect that the 7GB limit probably covers everything that's running on the runner? Not just our tests?
Maybe this already came up, but I'm sure there's some overhead for the test runner itself beyond just our Python code, and I suspect that the 7GB limit probably covers everything that's running on the runner? Not just our tests?
Yep, it seems like there's at least 2GB of overhead for the test runner and the stored outputs.
Current update is that all the integration tests pass and the validation doesn't and will be fixed at some point (and aren't memory related!).
Summary of changes: Plant Parts List
pudl_out
inputs used in the creation of the plant parts list are deleted after they're used in the PUDL plant part list module
pudl_out._dfs
is cleared to free memory.rmi-ci-fixes
branch of the PUDL repo. This branch should be merged first?FERC1 to EIA connection
coordinate.py
so this happens before the FERC to EIA connection takes place. Additionally, I added an option to prep this connection in rmi_out.plant_parts_eia
which is called in test_plant_parts_eia
, so this is done while the plant part list is already loaded there.rmi_out.plant_parts_eia
so this also happens when the plant part list is loaded in test_plant_parts_eia
and is then pickled. Now the distinct plant part list can be read in directly and passes to connect_ferc1_to_eia.execute()
. One alternative change I thought of doing was to save the PPL as a CSV or Parquet file so that specific columns can be read in. The only place where this is especially helpful is when connecting the training data to EIA true gran records. It's kind of nice to have all the outputs saved as pickled dataframes.
FYI a lot of the little changes have nothing to do with memory issues and are changes from the bot-auto-merge
branch being merged in.
So glad to see just the real breakage!
@cmgosnell do you want to address the validation errors in this PR? Or do that separately?
@katie-lamb what all changes do you still need to make or get merged into dev
in the PUDL for this PR to work?
@katie-lamb what all changes do you still need to make or get merged into
dev
in the PUDL for this PR to work?
@zaneselvans @cmgosnell I made a PUDL PR that handles the data type and cache clearing during the plant parts list creation. This should get merged into dev
before this RMI PR.
:exclamation: No coverage uploaded for pull request base (
main@d9b708d
). Click here to learn what that means. Patch has no changes to coverable lines.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Update update update (for @cmgosnell):
The CI now passes wowww! Major changes:
tox
or pytest
without the --five-year-coverage
argument will still run the CI on all the years of data.expected_errors
CSV files for the data validation tests for the five year coverage. These roughly match the original expected_errors
CSV files for all the data so I just went with it. But maybe there's a better way to check these expected errors files and make sure there's nothing wrong. The expected_errors
for all years of data probably also need to be updated so that the validation tests pass.pudl
installation in the pudl-rmi
environment now pulls from the dev
branch instead of the released version of PUDL. This might lead to more/faster maintenance down the line but it's also way nicer for modifying PUDL and having changes appear in this repo.plant_name_new
and ownership
columns in the plant part list to be plant_name_ppe
and ownership_record_type
respectively when it's created in PUDL. These name changes are reflected in this PR. Additionally, pudl.analysis.plant_parts_eia.PLANT_PARTS_ORDERED
was taken out and now the PLANT_PARTS
global dictionary does all the work.I'm not entirely sure why one of the data validation tests isn't passing. It seems like a weird type error because I'm pretty sure the expected and actual data is the same. Need to look into this.
I feel confused why the validation tests are failing when checking if the index are equal. It seems to be an issue with the type of the report_year
index level (the actual index is an Index
while the index read in from the CSV is Int64Index
. I set the argument exact=False
and am still failing the assert even though it's supposed to ignore type differences.
Okay @cmgosnell whenever you have a chance to look, this seems to work as well as it can given the current state of the
main
branch. The tests fail locally with Index mismatch errors that I think we've already talked about on the other long-running development branch, but the CI and other infrastructure stuff seems to work fine.