catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
476 stars 110 forks source link

Validate EIA Bulk data vs original API source #1896

Closed TrentonBush closed 1 year ago

TrentonBush commented 2 years ago

Does the new data source cover the expected areas at the expected granularity? If it is different, is it still workable?

zaneselvans commented 2 years ago

Did this get done? Is this applicable to the current (state-fuel only) version of the aggregated bulk fuel price data? How serious is the per-row vs. total aggregate MMBTU per unit issue that you mentioned in comments on #1765?

TrentonBush commented 1 year ago

The API data has additional aggregates not present in the bulk data and has slightly different coverage. The advantages of the API are likely small or would require a large amount of additional work to make use of.

The additional aggregates are of two types: 1) finer grained fuel type aggregates (such as breaking "petroleum liquids" into DFO, RFO, waste oil, etc) and 2) alternative groupings (such as "all fossil fuels", "natural gas plus other gas", or "Electric power non-CHP").

  1. The advantage in precision of the fine grained fuel aggregates is small. This is because many of these smaller categories don't exist in the fuel receipts costs data -- only DFO, RFO, and waste coal contribute any meaningful MMBTU of fuel receipts and even they are only 0.9% of MMBTU combined since 2013.
  2. The additional aggregates (like "all fossil fuels" or "nat gas plus other gas") could be useful in error checking or possibly for deducing more precise aggregates for redacted items. But that would probably be an involved process of setting up a big linear algebra system, debugging it, and managing tradeoffs between tractable solvers and noisy data.

A few other notes:

zaneselvans commented 1 year ago

We're well beyond the EIA API at this point, so this validation will not happen. Closing.