zaneselvans commented 2 years ago

We started building PUDL before we really understood much about ETL tools or software engineering and as a result the data transformations that PUDL does are not particularly well organized in terms of when / where they happen, and they don't have a standard API. This is affecting our ability to refactor the code and integrate new data and functionality.

To better understand what all the moving parts are and how we might assemble them more appropriately, this issue attempts to catalog and then categorize them. This should feed into the Prefect refactoring discussion #840.

This is a work in progress.

Extraction

Starts with an archived raw data source that can be accessed in a programmatic, reproducible way.
Raw data sources are typically partitioned based on temporal or spatial variables (e.g. year or state)
Different partitions of a raw data source may require different extractors, as file formats change over time (DBF, XLSX, XRBL, etc.)
Each combination of data source and partition provides access to one or more canonical raw tables.
Not all raw tables are available for all partitions within a data source, because the data that's reported changes over time.

Inputs

The ID of the data source (e.g. ferc1, eia923, epacems)
One or more key-value pairs indicating which individual partition(s) should be extracted. E.g. (year=2010, state="CA")
The canonical name of the raw table(s) to be extracted.

Outputs

A dictionary of canonical raw dataframes, each of which contains all the data for the requested partitions concatenated together.

Issues

We assume that all partitions of a given data source will use the same extractor, with the underlying raw data coming from the same original file format. However, we know that this isn't true and we can't hack our way around all the possible combinations as we've done for the old EIA DBF data.
We assume that the transformations required to make the raw tables from the different partitions of data easy to concatenate into unified multi-partition tables / dataframes are minimal -- e.g. renaming semantically identical columns to canonical names or adding a report_date column. This need not be true but thus far generally has been. But maybe this just something that's solved by construction -- i.e. any raw tables that can't be trivially made concatenatable get different names, and the more complex reshaping is relegated to the Transform step, which can combine multiple raw tables into a single output or split individual tables into more than one if needed.
Partly as a consequence of the above assumptions, we don't explicitly break out the extraction of different partitions as isolated actions, and instead expect to be given a collection of partitions to extract, which will be concatenated before they are returned.
The raw table name is kind of a quasi-partition and how it's supposed to be treated is kind of unclear.
Reading some of the data (Excel files) from disk is slow, even though the data isn't huge. When possible it would be nice to either only read the required data, or if we read it all to make sure we have the option of storing the results in a more accessible form for later use.
The raw input tables do not map 1-to-1 to eventual tables in the database, but we kind of act like they do, and that's confusing / hacky. Separating these namespaces would make the different steps in the flow of data more obvious and allow us to (potentially) specify different metadata like data types for different columns in different contexts.
The FERC 1 extraction doesn't follow the above patterns, because all of the raw tables are extracted (across all requested years) all at once, and written to the FERC 1 DB, which persists and is re-used later.
FERC 1 extraction also doesn't follow these patterns in that the "extract" step still does some things that are more like "transform" -- the SELECT statements don't pull all records. This has caused confusion on more than one occasion.
Not all available data in the EIA spreadsheets is currently extractable. We've skipped a lot of the spreadsheet tabs.
EPA CEMS behaves differently than all the other datasets because it's so big. Rather than returning a concatenated table / dataframe it returns individual partitions of data sequentially, each of which are passed through the transform and load steps independently.
Some data sources also provide more recent provisional data, and it often has a slightly different structure. Accommodating that has been a bit of a hack, but if every data partition could have its own independent extractor this would be straightforward.

Existing Operations:

Extract the whole FERC 1 VisualFoxPro DB to SQLite.
Select a subset of records from a FERC 1 DB table.
Extract and concatenate a bunch of raw tables across multiple years of EIA spreadsheets and DBF files.
Extract (and then serially process) EPA CEMS CSV files, one state-year at a time.

Potential Changes / Improvements:

Allow the specification of different extractors for each raw data partition to accommodate changes in raw data format/structure over time.
Compose together these different per-partition extractors to construct concatenated dataframes in a higher level class.
This class could also handle persistence of expensive to read data, e.g. caching the XlsxReader objects.
Make the raw data table names obviously distinct from the ultimate database table names.
Write the raw data tables into a database rather than returning dictionaries of dataframes. This could address several issues:
- FERC 1 DB extraction would look just like any other raw table extraction.
- As with the FERC 1 DB extraction, a lot of ETL development could operate on the raw tables in an existing DB, rather than needing to go back to the original raw files, which would improve the development / debugging process.
Allow any dataset to either return individual data partitions for iteration (like CEMS) or as an already concatenated table (like everything else)

Transformation & Tidying

Basic cleaning and reshaping of the data required to columns to the point of having uniform data types, physical units, codes, etc.

Code standardization.
Taking wide-form data (e.g. each column contains one month of net generation) and turning it into long-form data where each row is a single observation and each column is a single variable.

Applying transformations at different stages / scopes

Right now we apply many similar or identical transformations in different contexts, duplicating code.
Some of these could be applied in a single place across a broader scope of data, or they could be modularized so that the code only exists in once place and we're sure that it's doing the same thing every time it's applied.
We should identify which transformations should get applied one a per-column, per-table, per-dataset, or per-data source level, and potentially store that information -- which transformations get applied to what entities and when -- alongside the data+metadata structures.
The per table transformation functions could be restricted to just the most bespoke data wrangling operations that don't show up in multiple places.

Examples:

pudl.helpers.fix_eia_na() applies to pretty much any spreadsheet-based EIA data, across all columns and tables, and replaces any lone period . or entirely whitespace cell with a real NA value. This could be automatically applied to any EIA data when it's first extracted, or to all the dataframes after they've been concatenated, rather than within individual transformation functions.
The number-to-fixed-width-string conversion that we do for FIPS and ZIP codes should be applied to any column that's a FIPS or ZIP code regardless of where it's coming from.
Boolean columns that are encoded as strings, with some set of strings indicating True, and others indicating False, should be mapped to real boolean values + NA.
Re-encoding of coded columns to use a single canonical set of values which are linked to another table (already modularized).

Normalization / Entity Resolution / Harvesting

Reconciliation of data that is reported in multiple places inconsistently. Removal of duplicated information so that the output DB has a single source of truth.

Comparison of fixed values for a given entity across time (e.g. plant lat/lon) and estimation of the "true" value.
Comparison of inconsistent time varying values reported in multiple input tables at the same time.
Removal of the source columns after data deduplication and entity resolution has taken place.

Data Repair

Replacement of obvious outliers and missing values with best estimates to provide as complete a dataset as possible.

Identification and correction of data entry or units errors (e.g. tons vs. lbs, MWh vs. kWh, % vs. fractions).
This kind of work happens in both the transformation and the post-DB analysis & output functions now.
In a lot of cases we probably want it to happen quite late, so that we can clearly identify the tables that contain "real" data that was reported to the agencies, vs. repaired / imputed / extrapolated data that depends on our assumptions.
However in the context of the entity harvesting & resolution, having the repaired data can help us do a better job of building complete and self-consistent tables / entities.

Integration

Integration tasks link together cleaned data so that they can be used together more effectively. This can also involve linking together records within a single dataset.

Use of TF-IDF / cosine-similarity to identify which FERC steam plants records are likely associated with each other across time, and assign them a shared Plant ID. Currently this happens in the transform step, which is probably too early.
Association of manually (or maybe someday ML) assigned plant_id_pudl and utility_id_pudl values with individual plants and utilities found in the FERC 1 and EIA datasets.
Association of individual EIA generator-ownership slices from the Plant Parts List with the heterogeneously reported chunks of plants that show up in FERC Form 1.

Output / Display

Assembly of existing data for easy use or interpretation by humans. This type of operation can typically be accomplished using straightforward SQL.

Joining of names and labels to demystify IDs (e.g. fuel types, utility and plant names,
Calculation of simple, commonly used derived values (arithmetic functions of other columns in the same table/view)

Analysis

More complex analysis that generates new knowledge unavailable directly elsewhere. These operations benefit from having as clean and complete inputs as possible.

Calculation of the marginal cost of electricity.
Allocation of hourly electricity demand reported by utilities / BAs to individual states / counties.

bendnorman commented 2 years ago

This is super helpful thank you! Identifying parts of our process that aren't uniform seems like a good place to start.

Questions

Where does Data Repair happen?
What are the interim tables of the PUDL ETL people have requested? This could help us understand how to make logical groups of transformations.
It seems like our extraction step is not very uniform.
- Assuming that all partitions of a given data source will use the same extractor results in hacky extraction code (excel and dbf mix).
- What are the drawbacks of expecting a collection of partitions to extract?

Potential Changes

Based on your comments above, this is what I’m currently envisioning at a high level:

Make our extraction steps more uniform by extracting our data from zenodo and load each partition into a database (possible using Airbyte).

Our transformation steps will then read the raw data from the db into python and perform some transformations:

consolidate partitions of datasets
data tidying
entity resolution, harvesting
data repair

Each transformation step could load interim tables with appropriate names back to the db. This database will contain:

our raw data
useful interim tables that are created during our many transformations
Our final tables with full constraints and schemas
Database views of denormalized and aggregated tables

Limitations:

How should we manage schemas of interim tables?
How do we delineate interim, final, and output tables? How will this change as we add more analysis tables that might build off of one another?
How should we handle larger datasets? A sqlite db for smaller datasets and a clickhouse instance for CEMS and EQR? How will the two interact?

These thoughts might be out of scope for this issue but I wanted to jot them down somewhere.

bendnorman commented 2 months ago

This issue was fodder for our transition to an orchestration tool. Most of these issues/improvements were addressed by our migration to dagster.

catalyst-cooperative / pudl

Catalog and categorize PUDL data processing tasks #1401