catalyst-cooperative / rmi-ferc1-eia

A collaboration with RMI to integrate FERC Form 1 and EIA CapEx and OpEx reporting
MIT License
3 stars 3 forks source link

Improve CI workflows, add autoformatters / linters #236

Closed zaneselvans closed 1 year ago

zaneselvans commented 2 years ago

Okay @cmgosnell whenever you have a chance to look, this seems to work as well as it can given the current state of the main branch. The tests fail locally with Index mismatch errors that I think we've already talked about on the other long-running development branch, but the CI and other infrastructure stuff seems to work fine.

zaneselvans commented 2 years ago

@katie-lamb I updated the CI and package requirements to only ever use Python 3.10 (and to use the new bot automerge PR workflows) so hopefully your 3.8/3.9 dependency resolution issues can be ignored.

review-notebook-app[bot] commented 2 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

zaneselvans commented 2 years ago

What's the max memory usage of the tests at this point? Is it close to 7GB? I imagine the runner uses some memory on its own just to exist and have an OS, and even if the tests themselves were a bit under 7GB, the sum of the overhead and the tests could still be too much.

katie-lamb commented 2 years ago

When I run it locally the max memory usage is 5.7 GB (when the plant parts list is generated in test_ppl_out). The fact that it gets through test_ppl_out and fails during test_ferc1_to_eia maybe means it's not cleaning up memory as quickly as when I run it locally? For example maybe the plant part list should be explicitly cleaned up after test_deprish_to_eia_out.

zaneselvans commented 2 years ago

Maybe this already came up, but I'm sure there's some overhead for the test runner itself beyond just our Python code, and I suspect that the 7GB limit probably covers everything that's running on the runner? Not just our tests?

katie-lamb commented 2 years ago

Maybe this already came up, but I'm sure there's some overhead for the test runner itself beyond just our Python code, and I suspect that the 7GB limit probably covers everything that's running on the runner? Not just our tests?

Yep, it seems like there's at least 2GB of overhead for the test runner and the stored outputs.

katie-lamb commented 2 years ago

Current update is that all the integration tests pass and the validation doesn't and will be fixed at some point (and aren't memory related!).

Summary of changes: Plant Parts List

FERC1 to EIA connection

One alternative change I thought of doing was to save the PPL as a CSV or Parquet file so that specific columns can be read in. The only place where this is especially helpful is when connecting the training data to EIA true gran records. It's kind of nice to have all the outputs saved as pickled dataframes.

FYI a lot of the little changes have nothing to do with memory issues and are changes from the bot-auto-merge branch being merged in.

zaneselvans commented 2 years ago

So glad to see just the real breakage!

@cmgosnell do you want to address the validation errors in this PR? Or do that separately?

@katie-lamb what all changes do you still need to make or get merged into dev in the PUDL for this PR to work?

katie-lamb commented 2 years ago

@katie-lamb what all changes do you still need to make or get merged into dev in the PUDL for this PR to work?

@zaneselvans @cmgosnell I made a PUDL PR that handles the data type and cache clearing during the plant parts list creation. This should get merged into dev before this RMI PR.

codecov[bot] commented 1 year ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (main@d9b708d). Click here to learn what that means. Patch has no changes to coverable lines.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #236 +/- ## ======================================= Coverage ? 74.39% ======================================= Files ? 10 Lines ? 1183 Branches ? 0 ======================================= Hits ? 880 Misses ? 303 Partials ? 0 ``` Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=catalyst-cooperative). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=catalyst-cooperative)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

katie-lamb commented 1 year ago

Update update update (for @cmgosnell):

The CI now passes wowww! Major changes:

I'm not entirely sure why one of the data validation tests isn't passing. It seems like a weird type error because I'm pretty sure the expected and actual data is the same. Need to look into this.

katie-lamb commented 1 year ago

I feel confused why the validation tests are failing when checking if the index are equal. It seems to be an issue with the type of the report_year index level (the actual index is an Index while the index read in from the CSV is Int64Index. I set the argument exact=False and am still failing the assert even though it's supposed to ignore type differences.