catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Clean data within fuel_ferc1 table #58

Closed zaneselvans closed 5 years ago

zaneselvans commented 7 years ago

Now that we have the data importing into the PUDL DB, and at least of the right type and structure, we need to start looking at whether the values are meaningful, and either fixing things that are broken in an obvious way, or dropping the data that's bad.

This data cleaning should probably be done in a step before the final import into the PUDL DB happens, while the data is still in a DataFrame.

To get started, and figure out how this can all be structured, and to learn how to visualize the data to detect weirdness we need to fix or discard, let's just look for one kind of mistake: FERC respondents using the wrong units in their reports, like reporting kWh instead of MWh.

The easiest way to catch this kind of mistake isn't to look at the absolute numbers being reported (e.g. MWh generated or MW installed, or total cost of fuel) but to look at ratios of numbers -- either reported or calculated by us -- to see whether they're waaaaay off.

This would be quantities like:

We should set some threshold of difference beyond which a record is flagged as suspicious (e.g. 100x larger or smaller than the median value) and then go look at some of those individual records to see if we can figure out which values don't make sense. Plotting histograms or scatter plots of these values should make it easy to see if there are weird little outlier populations -- each one of those populations might have its own way of getting fixed, depending on which values result in the screw-up.

Once we understand better the ways in which the data are broken, we can look at different ways we might fix them, or drop records. It might be that there's no standardized way we can fix these mistakes, and thus we either need to fix them one at a time, or just drop them or replace with NA values.

zaneselvans commented 5 years ago

This issue has been largely dealt with, but there are residual obvious errors in the fuel reporting that we should be able to fix (or avoid) some of which are derived from the broad unit adjustments.