catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

plants_utils_eia860 introduces NA values which get dropped #1700

Open zaneselvans opened 2 years ago

zaneselvans commented 2 years ago

In some cases, plants or utilities may have missing attributes in their entity tables. E.g. there are several thousands plants that have no state. This can create issues when we're constructing the denormalized output tables, and result in data rows getting dropped, maybe unnecessarily.

For example, in fuel_receipts_costs_eia923() after merging in the results of plants_utils_eia860 there are 11,040 records that lack a utility_id_eia, most of which also happen to lack a state value. To ensure that the output table is usable and has all the IDs that downstream data products expect, these records are dropped, but this means that 11,040 FRC records with fuel delivery data are missing from the output table, even though they do have the plant, date, and fuel type information that's more fundamental to this table.

In creating the database views which replace the output tables, we should be more careful with these kinds of merges, and ensure that we aren't introducing null values we don't need to introduce, and are keeping as many of these data records as we can.

This caused problems / confusion in issue #1343

katie-lamb commented 2 years ago

Is it possible to try and fill in the state values? I've seen this problem crop up a few times where date_merge (or a normal merge) is performed on the output tables and the subsequent dropna removes a lot of previously valid records. Seems like we have a few of these imputation problems we could work on in the future.