Closed zaneselvans closed 5 years ago
The fuel_ferc1
duplicates are being introduced by the merge with plants_steam_ferc1
at the end of the pudl.transform.ferc1.fuel()
where the plant_id_ferc1
values are merged in from the transformed steam plants dataframe... which contains the duplicates. However, it appears that these are real duplicates not errors resulting from a previous merge, in that there are entries in the original f1_steam
table in the FERC database that have the same respondent_id
, plant_name
, and report_year
but which refer to different FERC "plants" For example: respondent_id=3
and plant_name="Lemon Creek"
gives two plants for every year, one of which has plant_kind=GT
and the other plant_kind=IC
What does this mean? Is there really no way for us to connect the steam
and fuel
tables together correctly? That seems bonkers. But then again this is FERC.
What to do:
Clearly we could aggregate on the basis of report_year
, plant_name
, and respondent_id
but that would mean losing a lot of categorical information in the steam table. The FERC Plant ID assignment appears to be doing a good job of picking out which records with the same plant name and respondent ID belong to different "plants" on the basis of other values.
The record_id
values are still distinct because the plant data is being reported on different lines in the FERC Form 1, so we can still check for "real" duplicates (i.e. any case where the same record_id
shows up more than once in a given table). But that kind of duplicate will appear when we merge together the fuel and steam tables as we are now, on matching but non-unique "key" fields.
There does not appear to be sufficient information in the fuel table to identify which plant records the fuel records pertain to when there are duplicates, but we can figure out which plants have this problem, and treat them differently... maybe we just aggregate all of the fuel costs and consumption across all of the different non-unique plants and bring that across? @cmgosnell what do you think?
It turns out that the fuel_ferc1
duplicates were getting introduced by the merge that was meant to bring in the plant_id_ferc1
values from the post-transform plants_steam_ferc1
table. It happens because there's not a unique mapping between the fuel and steam tables on plant_name
report_year
and utility_id_ferc1
For now I have removed the plant_id_ferc1
column (and the merge that would create it) from the fuel table, since there's no way to populate the whole thing uniquely. Another option would be to go ahead and merge it in but only on the records for which there's a unique mapping (which is about 90% of the records). However, given that it seems like we're going to have to combine the fuel and the steam tables in a more aggregated way generally, I'm not sure that would be very useful, and it might be confusing, since 10% of the records would have no plant id (even though in reality... the fuel use is associated with some plant id)
The
fuel_ferc1
table contains duplicate records, which differ only by their assignedplant_id_pudl
. This should never happen.Trace the transformation process for this table and figure out where the duplicates are being introduced.
See if it's related to the cosine similarity ties that are messing up #144.
Once it's fixed remove the
drop_duplicates()
calls that are being made in thefuel_by_plant_ferc1()
summary generation.