zaneselvans commented 2 years ago

Okay @cmgosnell whenever you have a chance to look, this seems to work as well as it can given the current state of the main branch. The tests fail locally with Index mismatch errors that I think we've already talked about on the other long-running development branch, but the CI and other infrastructure stuff seems to work fine.

zaneselvans commented 2 years ago

@katie-lamb I updated the CI and package requirements to only ever use Python 3.10 (and to use the new bot automerge PR workflows) so hopefully your 3.8/3.9 dependency resolution issues can be ignored.

review-notebook-app[bot] commented 2 years ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

zaneselvans commented 2 years ago

What's the max memory usage of the tests at this point? Is it close to 7GB? I imagine the runner uses some memory on its own just to exist and have an OS, and even if the tests themselves were a bit under 7GB, the sum of the overhead and the tests could still be too much.

katie-lamb commented 2 years ago

When I run it locally the max memory usage is 5.7 GB (when the plant parts list is generated in test_ppl_out). The fact that it gets through test_ppl_out and fails during test_ferc1_to_eia maybe means it's not cleaning up memory as quickly as when I run it locally? For example maybe the plant part list should be explicitly cleaned up after test_deprish_to_eia_out.

zaneselvans commented 2 years ago

Maybe this already came up, but I'm sure there's some overhead for the test runner itself beyond just our Python code, and I suspect that the 7GB limit probably covers everything that's running on the runner? Not just our tests?

katie-lamb commented 2 years ago

Maybe this already came up, but I'm sure there's some overhead for the test runner itself beyond just our Python code, and I suspect that the 7GB limit probably covers everything that's running on the runner? Not just our tests?

Yep, it seems like there's at least 2GB of overhead for the test runner and the stored outputs.

katie-lamb commented 2 years ago

Current update is that all the integration tests pass and the validation doesn't and will be fixed at some point (and aren't memory related!).

Summary of changes: Plant Parts List

The column data types are changed to strings or Categoricals where applicable - I'm in the process of moving this over so it happens in PUDL instead.
The intermediate pudl_out inputs used in the creation of the plant parts list are deleted after they're used in the PUDL plant part list module
- After each of the tests is run pudl_out._dfs is cleared to free memory.
These changes to the plant part list happen in the rmi-ci-fixes branch of the PUDL repo. This branch should be merged first?

FERC1 to EIA connection

The main inefficiency here was that both the full plant part list and distinct plant part list were loaded into memory at the same time. The full plant part list is used to connect the training data to true granularity EIA records. The distinct plant part list is what's actually used as the EIA side in the FERC to EIA connection.
I moved the connection between training data and true gran records to coordinate.py so this happens before the FERC to EIA connection takes place. Additionally, I added an option to prep this connection in rmi_out.plant_parts_eia which is called in test_plant_parts_eia, so this is done while the plant part list is already loaded there.
I also added an option to create the distinct plant parts list during rmi_out.plant_parts_eia so this also happens when the plant part list is loaded in test_plant_parts_eia and is then pickled. Now the distinct plant part list can be read in directly and passes to connect_ferc1_to_eia.execute().

One alternative change I thought of doing was to save the PPL as a CSV or Parquet file so that specific columns can be read in. The only place where this is especially helpful is when connecting the training data to EIA true gran records. It's kind of nice to have all the outputs saved as pickled dataframes.

FYI a lot of the little changes have nothing to do with memory issues and are changes from the bot-auto-merge branch being merged in.

zaneselvans commented 2 years ago

So glad to see just the real breakage!

@cmgosnell do you want to address the validation errors in this PR? Or do that separately?

@katie-lamb what all changes do you still need to make or get merged into dev in the PUDL for this PR to work?

katie-lamb commented 2 years ago

@katie-lamb what all changes do you still need to make or get merged into dev in the PUDL for this PR to work?

@zaneselvans @cmgosnell I made a PUDL PR that handles the data type and cache clearing during the plant parts list creation. This should get merged into dev before this RMI PR.

codecov[bot] commented 1 year ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (main@d9b708d). Click here to learn what that means. Patch has no changes to coverable lines.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #236 +/- ## ======================================= Coverage ? 74.39% ======================================= Files ? 10 Lines ? 1183 Branches ? 0 ======================================= Hits ? 880 Misses ? 303 Partials ? 0 ``` Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=catalyst-cooperative). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=catalyst-cooperative)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

katie-lamb commented 1 year ago

Update update update (for @cmgosnell):

The CI now passes wowww! Major changes:

I just made the GitHub workflow CI run on 5 years of data coverage (2015 - 2020). I did this instead of 1 year because with 5 years the matching models only perform slightly worse than with all the data, so the CI is somewhat testing how well the models are performing while still staying under the GitHub memory limit. Running tox or pytest without the --five-year-coverage argument will still run the CI on all the years of data.
I added new expected_errors CSV files for the data validation tests for the five year coverage. These roughly match the original expected_errors CSV files for all the data so I just went with it. But maybe there's a better way to check these expected errors files and make sure there's nothing wrong. The expected_errors for all years of data probably also need to be updated so that the validation tests pass.
The pudl installation in the pudl-rmi environment now pulls from the dev branch instead of the released version of PUDL. This might lead to more/faster maintenance down the line but it's also way nicer for modifying PUDL and having changes appear in this repo.
For clarity, I updated the plant_name_new and ownership columns in the plant part list to be plant_name_ppe and ownership_record_type respectively when it's created in PUDL. These name changes are reflected in this PR. Additionally, pudl.analysis.plant_parts_eia.PLANT_PARTS_ORDERED was taken out and now the PLANT_PARTS global dictionary does all the work.
Some clean up was done now that we're not memory constrained.
- The distinct plant parts list is now an output instead of being made as part of the plant part list output.
- The training data connections were moved back into the FERC to EIA connection module

I'm not entirely sure why one of the data validation tests isn't passing. It seems like a weird type error because I'm pretty sure the expected and actual data is the same. Need to look into this.

katie-lamb commented 1 year ago

I feel confused why the validation tests are failing when checking if the index are equal. It seems to be an issue with the type of the report_year index level (the actual index is an Index while the index read in from the CSV is Int64Index. I set the argument exact=False and am still failing the assert even though it's supposed to ignore type differences.

catalyst-cooperative / rmi-ferc1-eia

Improve CI workflows, add autoformatters / linters #236

Codecov Report