catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Integrate fuel proportions into FERC Plant ID assignment #266

Closed zaneselvans closed 5 years ago

zaneselvans commented 5 years ago

FERC Plant ID assignment (#144) can be greatly improved by including the relative proportions (and potentially absolute amounts) of fuel heat content, and possibly fuel costs, to the set of features that are used to link plant records together. Now that there is an easy way to generate those proportions on a per-plant-year basis, they should be integrated into the ID generation.

zaneselvans commented 5 years ago

This can be done in the pudl.ferc1.transform process, by merging the dataframe created by fuel_by_plant() with the steam_ferc1 dataframe before it goes into the FERCPlantClassifier -- and modifying the FERC Plant Classifier to pay attention to the fuel proportions as well. However, looking at this a bit has brought up some questions for me:

Initial fiddling with the setup didn't seem to improve (or even hardly change) the ID generation process, so I'm wondering if I'm doing something stupid/wrong here. The assigned IDs still have the same issue of about 1200-1300 plant records (10% of all the plant records) being left out and assigned orphan IDs, even though they appear to be part of very well defined plant time series when I look by hand. So, this is something to be addressed in conjunction with #221 and #144.