catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Convert `pu_eia860` to SQL view asset #2339

Closed bendnorman closed 1 year ago

bendnorman commented 1 year ago

New scope: This issue now refers to the transition of assets previously moved into Dagster using Python wrappers into using SQL Views. The task of adding this asset and others to the database will be addressed in #2490 and the epic #1973. The new scope of this issue only refers to the transition from Python wrappers to SQL View where possible.

We can use this task list as a template for each table conversion:

- [ ] Convert to SQL view
- [ ] Compare `PudlTabl.pu_eia860()` with the new SQL view to make sure the computation produces the same output.
- [ ] Replace the `pu_eia860` code in the `PudlTabl` method with a call to the db that reads the new view as a dataframe. We could either just use sqlalchemy and pandas or we might be able to use the existing code in the `pudl_sqlite_io_manager`.
zaneselvans commented 1 year ago

For the denormalized tables it seems like we probably want to bring in all the attributes from plants_entity_eia and utilites_entity_eia too. I was messing around this this a little bit in DBeaver over the weekend. Does it look something like:

SELECT *
FROM (
    plants_entity_eia
    JOIN plants_eia860
    ON plants_entity_eia.plant_id_eia = plants_eia860.plant_id_eia
) plants
JOIN (
    utilities_entity_eia
    JOIN utilities_eia860
    ON utilities_entity_eia.utility_id_eia = utilities_eia860.utility_id_eia
) utils
ON utils.utility_id_eia = plants.utility_id_eia
AND utils.report_date = plants.report_date
ORDER BY plant_id_eia, report_date;

(though I think this results in duplicate columns for the merge keys, and there might be collisions between location column names for the plants and the utilities, like state)

jdangerx commented 1 year ago

We could either just use sqlalchemy and pandas or we might be able to use the existing code in the pudl_sqlite_io_manager.

Good thought! I think it probably makes sense to do a couple of these ad-hoc and then have a ticket for extracting that code into pudl_sqlite_io_manager once we figure out what we need that code to do.

zaneselvans commented 1 year ago

I guess it doesn't make sense to duplicate the "Grab a DB table for me code" both in the IO Manager and in the PudlTabl object.

We'll need some logic (on some tables) in PudlTabl that grabs the right aggregated table based PudlTabl.freq (raw, monthly, annual)

e-belfer commented 1 year ago

This is a blocker for #2433 so I've begun to work on it in #2490, but I'm approaching it using Python for now instead of SQL. I've updated this issue to only refer to the SQL View conversion portion.

bendnorman commented 1 year ago

This issue is being tracked in issue #2433 and PR #2490.