Closed zaneselvans closed 10 months ago
Note that while the pudl_service_territories
script now has no problem writing the GeoParquet output files, the undissolved utility & balancing authority outputs, and the dissolved utility outputs contain too much data in the geometry
column for Arrow to handle (more than 2GB) and there's no facility built into GeoPandas for creating row-groups based on a column or a set of columns. See this geopandas issue.
Ideally I think we would want to partition the undissolved outputs by report_date
and utility_id_eia
/ balancing_authority_id_eia
and the dissolved outputs by report_date
.
Edit: was able to fix this by setting row_group_size=512
, which is very small. But these tables aren't that big, and the geometry column has huge data structures in it. It seems like the Parquet format is calculating useful row-group stats without our needing to specifically designate the columns.
I've been messing around with the Click CLI framework and it's pretty great. Much more ergonomic and compact, and we depend on it indirectly already, so I though I would (for fun) migrate our argparse based CLI modules and do a little related cleanup because it feels good.
Some nice things about Click:
pudl
Follow-on tasks were moved to #3121