Closed cmgosnell closed 4 years ago
Turns out this was not a problem at all (or previous changes to the record id generator fixed it). When I generate the df with eia_known
(the training data ids with the master unit list data), I was getting a handful of duplicate records. This is because there are some duplicates in the training data when they are relabeled to their corresponding true plant part in the master unit list.
Right now, I've generated what should be unique record ids for FERC, EIA and rmi's test data. The EIA records have ~500 non-unique records. The EIA ids are generated in
plant_parts_agg_eia.add_record_id()
with the following columns:plant_id_eia
_[list of id_cols from plant_parts]
year
plant_part
ownership
utility_id_eia
. For some reason I'm getting multiple versions of the same record with differentfraction_owned
. I think this means there is some error in the way I set upslice_by_ownership
.