catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Address fuel_ferc1 record_id duplicates #267

Closed zaneselvans closed 5 years ago

zaneselvans commented 5 years ago

The fuel_ferc1 table contains duplicate records, which differ only by their assigned plant_id_pudl. This should never happen.

Trace the transformation process for this table and figure out where the duplicates are being introduced.

See if it's related to the cosine similarity ties that are messing up #144.

Once it's fixed remove the drop_duplicates() calls that are being made in the fuel_by_plant_ferc1() summary generation.

zaneselvans commented 5 years ago

The fuel_ferc1 duplicates are being introduced by the merge with plants_steam_ferc1 at the end of the pudl.transform.ferc1.fuel() where the plant_id_ferc1 values are merged in from the transformed steam plants dataframe... which contains the duplicates. However, it appears that these are real duplicates not errors resulting from a previous merge, in that there are entries in the original f1_steam table in the FERC database that have the same respondent_id, plant_name, and report_year but which refer to different FERC "plants" For example: respondent_id=3 and plant_name="Lemon Creek" gives two plants for every year, one of which has plant_kind=GT and the other plant_kind=IC

What does this mean? Is there really no way for us to connect the steam and fuel tables together correctly? That seems bonkers. But then again this is FERC.

zaneselvans commented 5 years ago

What to do:

Clearly we could aggregate on the basis of report_year, plant_name, and respondent_id but that would mean losing a lot of categorical information in the steam table. The FERC Plant ID assignment appears to be doing a good job of picking out which records with the same plant name and respondent ID belong to different "plants" on the basis of other values.

The record_id values are still distinct because the plant data is being reported on different lines in the FERC Form 1, so we can still check for "real" duplicates (i.e. any case where the same record_id shows up more than once in a given table). But that kind of duplicate will appear when we merge together the fuel and steam tables as we are now, on matching but non-unique "key" fields.

There does not appear to be sufficient information in the fuel table to identify which plant records the fuel records pertain to when there are duplicates, but we can figure out which plants have this problem, and treat them differently... maybe we just aggregate all of the fuel costs and consumption across all of the different non-unique plants and bring that across? @cmgosnell what do you think?

zaneselvans commented 5 years ago

It turns out that the fuel_ferc1 duplicates were getting introduced by the merge that was meant to bring in the plant_id_ferc1 values from the post-transform plants_steam_ferc1 table. It happens because there's not a unique mapping between the fuel and steam tables on plant_name report_year and utility_id_ferc1 For now I have removed the plant_id_ferc1 column (and the merge that would create it) from the fuel table, since there's no way to populate the whole thing uniquely. Another option would be to go ahead and merge it in but only on the records for which there's a unique mapping (which is about 90% of the records). However, given that it seems like we're going to have to combine the fuel and the steam tables in a more aggregated way generally, I'm not sure that would be very useful, and it might be confusing, since 10% of the records would have no plant id (even though in reality... the fuel use is associated with some plant id)