catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
450 stars 106 forks source link

Map new EIA plants and utilities with PUDL IDs for 2024Q1 update #3636

Closed cmgosnell closed 2 weeks ago

cmgosnell commented 2 weeks ago

Overview

Closes #3618 and closes #3617.

What problem does this address? we skipped mapping the new EIA plants and utilities so the nightly builds failed.

What did you change? I followed our docs here. i regenerated the full db and ran: pytest test/integration/glue_test.py --live-dbs --save-unmapped-ids

and then added the utils and plants to the manually compiled glue sheets. there was a lot of new plants over 5 MWs to map. Most of those were new solar plants that didn't have matches but many of them did! The sheet was taking forever so filter so I used this lil notebook cell to quickly check plants:

plants = defs.load_asset_value("out_eia__yearly_plants")
(
    plants[
        plants.plant_name_eia.str.lower().str.contains("waco")
        & (plants.state == "TX")
    ]
    .drop_duplicates(subset=["plant_id_eia"])
    .sort_values(["latitude"])
    .set_index(["plant_id_pudl"])
)

Testing

How did you make sure this worked? How can a reviewer verify this?

# To-do list
- [ ] If updating analyses or data processing functions: make sure to update or write data validation tests (e.g.,  `test_minmax_rows()`)
- [ ] Update the [release notes](../docs/release_notes.rst): reference the PR and related issues.
- [ ] Ensure docs build, unit & integration tests, and test coverage pass locally with `make pytest-coverage` (otherwise the merge queue may reject your PR)
- [ ] Review the PR yourself and call out any questions or issues you have
- [ ] For minor ETL changes or data additions, once `make pytest-coverage` passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and [run relevant data validation tests](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/testing.html#data-validation) using `pytest` and `--live-dbs`.
- [ ] For significant ETL, data coverage or analysis changes, once `make pytest-coverage` passes, ensure the full ETL runs locally and [run data validation tests](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/testing.html#data-validation) using `make pytest-validate` (a ~10 hour run). If you can't run this locally, run the `build-deploy-pudl` GitHub Action (or ask someone with permissions to). Then, check the logs on the `#pudl-deployments` Slack channel or `gs://builds.catalyst.coop`.
aesharpe commented 2 weeks ago

Why did the contents of pudl_id_mappiny.xlsx go down?

cmgosnell commented 2 weeks ago

Why did the contents of pudl_id_mapping.xlsx go down?

@aesharpe the size decreased probably/maybe because i saved it using LibreOffice not Excel. There certainly are more lines in there now. I also removed some of the highlighting that was in there

zaneselvans commented 2 weeks ago

The file shrinkage does seem a little weird, but internally Excel (and Word, and Powerpoint...) files contain ZIP compressed XML documents, so my guess would be that either the new spreadsheet was more compressible, or LibreOffice had a higher compression setting, or the removal of the highlighting removed a bunch of markup and made the document smaller. A 5% size reduction doesn't seem to wacked.