catalyst-cooperative / rmi-ferc1-eia

A collaboration with RMI to integrate FERC Form 1 and EIA CapEx and OpEx reporting
MIT License
3 stars 3 forks source link

Ensure unique EIA record ID generation #20

Closed cmgosnell closed 4 years ago

cmgosnell commented 4 years ago

Right now, I've generated what should be unique record ids for FERC, EIA and rmi's test data. The EIA records have ~500 non-unique records. The EIA ids are generated in plant_parts_agg_eia.add_record_id() with the following columns: plant_id_eia _ [list of id_cols from plant_parts] year plant_part ownership utility_id_eia. For some reason I'm getting multiple versions of the same record with different fraction_owned. I think this means there is some error in the way I set up slice_by_ownership.

cmgosnell commented 4 years ago

Turns out this was not a problem at all (or previous changes to the record id generator fixed it). When I generate the df with eia_known (the training data ids with the master unit list data), I was getting a handful of duplicate records. This is because there are some duplicates in the training data when they are relabeled to their corresponding true plant part in the master unit list.