catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Clean up PUDL CLI tools; use Click framework #3107

Closed zaneselvans closed 10 months ago

zaneselvans commented 10 months ago

I've been messing around with the Click CLI framework and it's pretty great. Much more ergonomic and compact, and we depend on it indirectly already, so I though I would (for fun) migrate our argparse based CLI modules and do a little related cleanup because it feels good.

Some nice things about Click:

- [x] Migrate `pudl_etl`
- [x] Migrate `ferc_to_sqlite`
- [x] Migrate `pudl_check_fks`
- [x] Migrate `metadata_to_rst`
- [x] Retire `epacems_to_parquet` (leave it to dagster)
- [x] Retire `state_demand` (analysis converted into [an example notebook](https://www.kaggle.com/code/catalystcooperative/02-state-hourly-electricity-demand) on Kaggle)
- [x] Migrate `pudl_datastore`
- [x] Update `pudl-examples` README to refer to Kaggle for running the notebooks.
- [x] Retire `pudl_setup` See #2293
- [x] Update docs to reflect any changed or retired CLIs
- [x] Attempt a branch deployment to ensure removal of `pudl_setup` hasn't broken anything
- [x] Figure out why `pudl_territories` is sad and either fix & migrate, or retire it. See #1174
- [x] Add integration test for `pudl_service_territories` CLI
- [x] Add integration test for `pudl_datastore` CLI
- [x] Use smaller row-groups in `pudl_service_territory` GeoParquet outputs

Follow-on tasks were moved to #3121

zaneselvans commented 10 months ago

Note that while the pudl_service_territories script now has no problem writing the GeoParquet output files, the undissolved utility & balancing authority outputs, and the dissolved utility outputs contain too much data in the geometry column for Arrow to handle (more than 2GB) and there's no facility built into GeoPandas for creating row-groups based on a column or a set of columns. See this geopandas issue.

Ideally I think we would want to partition the undissolved outputs by report_date and utility_id_eia / balancing_authority_id_eia and the dissolved outputs by report_date.

Edit: was able to fix this by setting row_group_size=512, which is very small. But these tables aren't that big, and the geometry column has huge data structures in it. It seems like the Parquet format is calculating useful row-group stats without our needing to specifically designate the columns.