Background

After creating tools to translate the FERC XBRL filings into SQLite databases, we decided that the old Visual FoxPro DBF and new XBRL data will need their own independent extract + transform processes. The new data has much more structure and metadata, and will be quite a bit easier to work with than the historical data, so it doesn't make sense to convert the new data into the old structure just so we can run it through the same old transformations (this is discussed in catalyst-cooperative/pudl#1579).

This means a lot of new code and transformations, and has precipitated a major refactor of the FERC Form 1 transformations -- especially since we are going to be going after many additional Form 1 tables beyond the ones we've already cleaned up and integrated into the PUDL database.

Now that we have access to the raw XBRL data, we've been working on areas in parallel:

Making the XBRL data acquisition & extraction process more robust and automated, and
Re-writing the transform functions to accommodate both XBRL and DBF data, and to be more modular and re-usable as we expand our coverage to additional tables.

FERC XBRL Data Acquisition (@zschira)

This includes the data scraping and archiving process, up to the point of having versioned Zenodo depositions available through the PUDL Datastore, and regularly updated with minimal human intervention.
This work will take place in the pudl-scrapers & pudl-zenodo-storage repositories.

Issues

[x] catalyst-cooperative/pudl#1593
[x] Package ferc-xbrl-extractor so it's installable by PUDL
[x] https://github.com/catalyst-cooperative/pudl-scrapers/issues/41
[x] catalyst-cooperative/pudl-scrapers#45
[x] https://github.com/catalyst-cooperative/pudl-scrapers/issues/42
[x] https://github.com/catalyst-cooperative/ferc-xbrl-extractor/issues/19
[x] https://github.com/catalyst-cooperative/pudl-scrapers/issues/39
[x] https://github.com/catalyst-cooperative/pudl-scrapers/issues/26
[ ] #1418 (important since XBRL RSS posts individual filings)

Pre-extract (@zschira)

This includes everything that takes the raw inputs archived on Zenodo and turns them into coherent SQLite databases which we can archive, publish and use as standalone resources.
We will also produce detailed machine-readable metadata comparable in detail to what is available in the XBRL.
This work will primarily take place in the ferc-xbrl-extractor repository.
Updates that impact the main PUDL repo will be reflected on the xbrl_integration branch.

Release Issues

[x] catalyst-cooperative/pudl#1668
[x] catalyst-cooperative/pudl#1861
- [x] FERC 1: Both DBF and XBRL now
- [x] FERC 714: XBRL for now, CSV later, see catalyst-cooperative/pudl#1859
- [x] FERC 2: XBRL now, DBF later, see catalyst-cooperative/pudl#1859
- [x] FERC 6 & 60: XBRL now, possibly DBF later, see catalyst-cooperative/pudl#1859
[x] catalyst-cooperative/pudl#1667
[x] Release Catalyst packages on conda-forge or isolate pre-extract steps from PUDL repo
- [x] ferc-xbrl-extractor. conda-forge feedstock here
- [x] Arelle. conda-forge feedstock here
[x] Archive & Publish XBRL derived FERC SQlite DBs catalyst-cooperative/pudl#1830
- [x] https://github.com/catalyst-cooperative/ferc-xbrl-extractor/issues/17
- [x] catalyst-cooperative/pudl#1860
- [x] Integrate XBRL to SQLite conversion of all FERC forms into the nightly builds
- [x] Automate Datasette redeployment as part of the nightly builds
- [x] #2080

Post-release Issues

[x] catalyst-cooperative/ferc-xbrl-extractor#33
[ ] #2081
[x] #1420
[x] catalyst-cooperative/pudl#1863

Update Existing ETL

Updating our transformations is a mix of software engineering and data wrangling tasks. We want to get the software somewhat stable and documented before involving lots of people working in parallel on unfamiliar data, and so we've broken this into 3 phases of work:

Phase 1: Software Design and Alpha Testing (@zaneselvans & @cmgosnell)

Get to the point where the fuel_ferc1 and plants_steam_ferc1 are loading into the DB successfully, and provide functionality comparable to the old DBF, extending data coverage through 2021.
Work on this phase will branch off of and be merged into xbrl_steam.
Unit and Integration tests should exist and pass, at which point we can merge the xbrl_steam branch into xbrl_integration.
After this phase, documentation should be good enough, and the design stable enough that we can bring in other people to work on additional tables that we have existing transforms for, and familiarity with.

Issues

[x] catalyst-cooperative/pudl#1739
[x] catalyst-cooperative/pudl#1706
[x] catalyst-cooperative/pudl#1738
[x] catalyst-cooperative/pudl#1722
[x] catalyst-cooperative/pudl#1707
[x] catalyst-cooperative/pudl#1876
[x] catalyst-cooperative/pudl#1853
[x] catalyst-cooperative/pudl#1878
[x] catalyst-cooperative/pudl#1877
[x] catalyst-cooperative/pudl#1705
[x] Merge catalyst-cooperative/pudl#1721
[x] catalyst-cooperative/pudl#1924
[x] Merge catalyst-cooperative/pudl#1962

Phase 2: Beta Testing w/ Familiar Tables (#1801)

Refactor all of our our existing FERC Form 1 transform functions to use the new framework, extending coverage to the 2021 XBRL data.
This will include creating additional transform functions and parameterizations as needed to deal with more kinds of tables and data problems not encountered in the fuel_ferc1 and plants_steam_ferc1 tables.
Based on feedback from this experience we may also make some changes to the transform framework.
Work in this phase will branch off of and merge back into the xbrl_integration branch.
When this phase is complete, and all unit & integration tests are passing, we will merge xbrl_integration into dev in the PUDL repository, and can make an initial release of this data publicly.

Issues

[x] #1981 @cmgosnell
[x] #1801
- [x] catalyst-cooperative/pudl#1802 @cmgosnell
- [x] catalyst-cooperative/pudl#1803 @cmgosnell
- [x] catalyst-cooperative/pudl#1820 @cmgosnell
- [x] catalyst-cooperative/pudl#1735 @aesharpe
[x] catalyst-cooperative/pudl#1807 @zaneselvans & @cmgosnell
[x] Merge catalyst-cooperative/pudl#1665

Phase 3: Integrate New FERC 1 Data & Methods

Now that we've got the 2021 data integrated with and working as well as the 2020 data, we'll move on to expanding coverage to other tables, in both the earlier DBF data and new XBRL data.
At this point the software design will hopefully be stable and able to deal with whatever new problems we encounter.
We will also start integrating new data cleaning methods that we haven't previously employed in our published data.
Work in this phase will branch off of and merge into dev.
Which tables and data cleaning to prioritize will be guided by @arengel & @jrea-rmi, along the lines of #1568

key

⭐ = DBF-XBRL mapping is simple & worth delegating, table is ready to be taken on
🟧 = table is ready to be taken on, but may require new reshaping transforms

Issues

[x] catalyst-cooperative/pudl#2040 @zaneselvans & @aesharpe
[x] #2110 @zaneselvans
[x] catalyst-cooperative/pudl#2012 @zaneselvans
- [x] catalyst-cooperative/pudl#2021 @zaneselvans
- [x] #2075
- [x] catalyst-cooperative/pudl#2014 @zaneselvans
[x] #1804
- [x] catalyst-cooperative/pudl#1807 @zaneselvans & @cmgosnell
- [x] #1805 ⭐ @cmgosnell
- [x] #1806 @cmgosnell
- [x] #1808 (one-to-many, 4 XBRL tables) @zaneselvans
[x] #1809
- [x] #1810 ⭐ @aesharpe
- [x] #1811 (one-to-many, 8 XBRL tables)
[x] #1812
- [x] #1816 (one-to-one) 🟧 @zschira
- [x] #1817 (one-to-one) ⭐ @aesharpe
- [x] #1813 (many-to-one, 2 DBF tables) ⭐ 🟧 @cmgosnell
[x] #1818
- [x] catalyst-cooperative/pudl#1820 @cmgosnell
- [x] #1819 @cmgosnell
[x] #1822

2023 Q1 Issues

[x] Transform other, lower priority RMI tables
- [x] #1815 (one-to-many, 2 XBRL tables)
- [x] #1821 (one-to-many, 8 XBRL tables)
- [x] #1823 (one-to-many, 9 XBRL tables)
- [x] #1824 (one-to-one) 🟧
[ ] catalyst-cooperative/pudl#2016 @zaneselvans & @cmgosnell
- [ ] catalyst-cooperative/pudl#2015 @zaneselvans & @zschira
- [ ] #2076
- [ ] #2179
[ ] Infrastructure Cleanup
- [ ] #2165
- [x] #2160
[ ] #2180
- [ ] #471
- [ ] #2066
- [x] #1980
- [ ] #1825
[ ] Metadata Documentation/Annotation
- [ ] #2074
- [ ] #2173

Icebox

[ ] Integrate total labeler into small/hydro/pumped transform step @aesharpe
[ ] catalyst-cooperative/pudl#1968 @zaneselvans
[ ] Validation
- [ ] #2164

Given the need to integrate a whole second stream of data extraction and transformation into the FERC 1 process here, and the fact that we have wanted to disentangle some of the extraction vs. transformation code (deferring some complicated steps until later in the process) it might be worth thinking about those tasks together. Rather than shoehorning all of the XBRL data into the existing pipeline, and then subsequently disassembling the now even more complicated pipeline.

It seems like after the data is in the PUDL DB, there will be another layer of work that has to happen to integrate it with the FERC-EIA record linkage (and any other downstream analysis), since some of the old columns (record_id) won't exist any more.

There are 4 combinations of data source / existing code. In order we want to tackle them:

PUDL tables from XBRL
Hub tables from XBRL
Hub tables from FoxPro
PUDL tables from FoxPro (into our new transform architecture)

Transform step design thoughts:

we make a generic table transformer w/ some execute() or coordinating method that has standard cleaning/renaming/reshaping/encode/etc steps.
Make clean, nice lil table-specific transformers inheriting from the generic transformer for each of the new XBRL tables
Make a lil shell of the generic transformer for the old foxpro tables that is mostly a wrapper for the existing transform functions
Make a concatenater

Comments on Dagster:

I don't it is possible for part of our ETL to be implemented using Dagster. If the new FERC data was going to be an entirely separate pipeline then I'd say we should use dagster but given it will be intertwined with our existing ETL I don't think we should

However, we should keep some dagster concepts in mind while designing these FERC transformations. I would highly recommend reading the dagster concepts docs and our rough plan for applying these concepts to PUDL.

Each transform function will eventually be turned into a dagster "op" which is a node in the DAG. An op takes some upstream inputs, does some processing then passes the result to the next op in the DAG. Dagster is strict about what can be fed into ops:

An op only starts to execute once all of its inputs have been resolved. Inputs can be resolved in two ways:

The upstream output that the input depends on has been successfully emitted and stored.

The input was stubbed through config.

In other words, the input of an op has to be the output of another op OR configuration. Configuration can come in the form of Resources and Config which allow you to configure DAG wide resources in a yaml file. ETL wide abstractions like the Datastore, pudl_settings and dataset settings will likely become Resources or Config. This means ops won't accept python constants passed into functions. For example:

@op
def extract():
      return pd.read_csv("poopy_data.csv")

@op
def fillnans(df, fill_value):
      return df.fillna(fill_value)

@graph
def run_etl()
      df = extract()
      fillnans(df, "Gas")

won't work because the fill_value parameter is a python string, not the output of another op or a Dagster Resource or Config. You could fix this by configuring the fill_value parameter using Dagster Configs.

Also, these rules input rules only apply to functions you think should be included in the DAG. For example, helper functions probably won't be included in the DAG.

def fillnans(df, fill_value):
      return df.fillna(fill_value)

@op
def transform(df):
     df = fillnans(df, "Gas")
     return df

The dagster refactor and xbrl stuff will probably be happening in parallel so Zane and I will probably have more guidance as we refactor!

Some more thoughts on Dagster:

You can create nested graphs in Dagster to allow for some logic hierarchy. @zaneselvans and I envision incrementally applying dagster graphs to our ETL. There are multiple levels we've identified:

Turn our _etl_{dataset}() functions into ops and construct a graph. This is a very simple DAG but would enable us to run the ETLs in separate processes.
Turn our ETL functions into graphs where the ops are the extract and transform steps.
ETL functions are graphs, E and T steps are graphs and individual table transforms are ops.
ETL functions are graphs, E and T steps are graphs, individual table transforms are graphs and individual reusable cleaning functions are ops.

When we're writing XBRL functions we don't need to think much about the first two levels because we are only making changes within the ferc1.transform() function.

Option 3

I wrote up a little prototype for option 3. It's ok but it seems awkward to have to define all of the transformed tables in the transform graph's outs param:

https://github.com/catalyst-cooperative/pudl/blob/aa36c91766e70eb3a178232655f7c79fb60b6a87/notebooks/work-in-progress/ferc_dagster_prototype.py#L25-L33

This is the recommended method for returning multiple outputs from graphs. I'm curious if it is possible for graphs to treat dictionaries as a single output instead of multiple.

Satisfying option three shouldn't be that difficult because we can use the existing structure in transform.ferc1. I do have a couple of questions:

Do the table transform functions depend on previously cleaned tables? It looks like the plants_steam() depends on the fuel_ferc1() table for plant_id assignment.
Do the transform functions depend on multiple tables in ferc1_raw_dfs? This might not be the case in Form 1 land but I'm pretty sure final EIA transform functions rely on multiple raw tables to produce a single cleaned table.

Option 4

Option 4 is a bit trickier because we want generic cleaning functions that can be parameterized for each table but do not have to use dagster config. Dagster recommends using Op factories for this situation. They work but feel a little kludgy. Here is an example of a generic transform op:

def rename_columns_factory(
    name="default_name",
    ins=None,
    column_mapping=None,
    **kwargs,
):
    """
    Args:
        name (str): The name of the new op.
        ins (Dict[str, In]): Any Ins for the new op. Default: None.

    Returns:
        function: The new op.
    """

    @op(name=name, ins=ins, **kwargs)
    def rename_df(context, df):
        context.log.info(f"\n The DataFrame: {df}\n")
        context.log.info(f"\n The Op Ins: {context.op_def.ins}\n")
        t_df = df.rename(columns=column_mapping)
        context.log.info(f"\n The Transformed DataFrame: {t_df}\n")
        return t_df

    return rename_df

@op
def extract():
    return pd.DataFrame([1,2], columns=["col"])

@job()
def etl():
    df = extract()
    column_mapping = {"col": "column"}
    transformed_df = rename_columns_factory(column_mapping=column_mapping)(df)

etl.execute_in_process()

rename_columns_factory() parametrizes the inner function rename_df() which is an op. It's kind of mind-bending because there is a lot of function wrapping / decorating going on here. If we like this pattern, this is what a dagster friendly version without dagster abstractions would look like:

def rename_columns_factory(
    column_mapping=None,
):
    """
    Args:
        column_mapping: Dict of column rename mappings. 

    Returns:
        function: the rename_df function.
    """
    def rename_df(df):
        print(f"\n The DataFrame: {df}\n")
        t_df = df.rename(columns=column_mapping)
        print(f"\n The Transformed DataFrame: {t_df}\n")
        return t_df

    return rename_df

def extract():
    return pd.DataFrame([1,2], columns=["col"])

def etl():
    df = extract()
    column_mapping = {"col": "column"}
    transformed_df = rename_columns_factory(column_mapping=column_mapping)(df)

etl()

An open question here is where we want to store the transform parameters.

This is kind of an out there idea, but could we develop our own op factory decorator that makes creating op factories a little less verbose. Something like:

from functools import partial

def op_factory(name="default_name", ins=None, **kwargs):
    def decorator(op_func):
        @wraps(op_func)
        def wrapper(**op_kwargs):
            return op(name=name, ins=ins, **kwargs)(partial(op_func, **op_kwargs))

I'm not 100% sure that this would work as is (decorators are confusing), but the idea here is that it could be applied like:

@op_factory
def rename_df_factory(context, df, column_mapping=None):
    context.log.info(f"\n The DataFrame: {df}\n")
    context.log.info(f"\n The Op Ins: {context.op_def.ins}\n")
    t_df = df.rename(columns=column_mapping)
    context.log.info(f"\n The Transformed DataFrame: {t_df}\n")
    return t_df

@op
def extract():
    return pd.DataFrame([1,2], columns=["col"])

@job()
def etl():
    df = extract()
    column_mapping = {"col": "column"}
    transformed_df = rename_df_factory(column_mapping=column_mapping)(df)

Here, when you call rename_df_factory with column_mapping as a keyword arg, the decorator will create a partially specified method, then apply the op decorator to that method, and return the created op.

I updated the FERC cleaning dagster prototype so we don't have to pass the transformed_dfs dictionary between table cleaning functions:

https://github.com/catalyst-cooperative/pudl/blob/52370c2b329acd06c6f74ca04ec8f35c970ccf13/notebooks/work-in-progress/ferc_dagster_prototype.py#L59-L76

Passing individual table depencies makes it easy to understand what tables depend on one another instead of passing around the entire transformed_dfs dictionary.@cmgosnell let me know if this answers some of your questions in the xbrl_steam PR.

It creates this fun dagster graph:

Vocab:

Table transform functions are responsible for cleaning one table form a datasource. The functions typically accept some raw data and return a cleaned dataframe.
Generic cleaning functions are specific steps within table transform functions. For example, renaming columns, filling na values, backfills... These functions typically accept a dataframe to clean and some parameters to specify how the data should be cleaned. For example, a column rename dictionary, fillna value...

Background

We want to develop a way to convert generic cleaning functions to dagster ops so they are documented in the DAG and could be validated using dagster-pandera.

There are two ways to configure ops, ConfigSchema and op factories. Here are a couple of options for Dagster-friendly generic transform functions:

Option 1: Use Dagster `ConfigSchema`s for each generic transform function

Description of ConfigSchema from dagster docs:

Various Dagster entities (ops, assets, resources) can be individually configured. When launching a job that executes (ops), materializes (assets), or instantiates (resources) a configurable entity, you can provide run configuration for each entity. Within the function that defines the entity, you can access the passed-in configuration off of the context. Typically, the provided run configuration values correspond to a configuration schema attached to the op/asset/resource definition. Dagster validates the run configuration against the schema and proceeds only if validation is successful.

A common use of configuration is for a schedule or sensor to provide configuration to the job run it is launching. For example, a daily schedule might provide the day it's running on to one of the ops as a config value, and that op might use that config value to decide what day's data to read.

Configs are mostly used for parameters that are changed between runs. For example, a pudl_out path or dataset processing settings. We initially did not entertain this option because these generic cleaning functions will be used dozens of times and the parameters will not change for different runs. Given we need to store all of the transformation parameters somewhere, we realized it might make sense to set a default dagster config with all of the static generic transform function metadata.

You can view the pseudo code here.

Pros

Can overwrite transform parameters at runtime. (Is this desirable?)
Don't have to use dagster factories, which might confuse some developers.

Cons

Dagster doesn't allow for duplicate op names. For example, the transform_steam graph has an op named rename_columns and an op named rename_columns_2 because rename_columns() is called for the dbf and xbrl dataframes. This prevents us from understanding what config is associated with xbrl and dbf.
Converting our transform metadata to a dagster config might be tricky.
The generic cleaning functions might be hard to test because the parameters come from the dagster context keyword.

Option 2: Pass parameters from transform metadata classes into op factories

This option grabs the transform metadata for a given table in a table transform function and passes the relevant data to an op factory of a generic cleaning function.

You can view the pseudo-code here.

Pros

Easier to test the generic cleaning functions because the context keyword isn't needed.
Can specify the op name instead of having the auto-incrementing suffix for duplicate ops.

Cons

Need to specify the name of the transform op. The pseudo-code manually specifies the name, but it could probably be programmatically generated, given the metadata.
Need to create a factory for every cleaning function. We could probably create a decorator that turns a python function into a parameterizable op.

Option 3: Lookup relevant transform metadata within op factory

This option grabs the relevant transform parameters from the transform metadata structure given the table name.

You can view the pseudo-code here.

Pros

Only need to pass the name of the table name to the op factory.

Cons

Increases coupling. The op factories need to know what the metadata classes look like.

Option 4: Use transformer class instead of dagster ops for generic cleaning functions

We might get diminishing returns applying dagster deeper into our call stack. Are the UI documentation and potential pandera validation worth the complexity of converting generic cleaning functions into ops? Most generic cleaning functions will be applied sequentially so there won't be complex dependencies to manage.

@bendnorman I think I understand these options and it seems like option 2 is our best case at least to move things forward before we fully integrate dagster as our orchestrator. Does this sound right to you?

Yeah, I think option 2 or 4 are the front runners. All options rely on a non-dagster data structure to store cleaning metadata. If we went with option 2, y'all could just create a library of generic cleaning functions that can be converted to ops. If we go with option 4, the cleaning functions would likely be methods in a transformer class that couldn't be easily converted to ops.

I'm kind of cooling off on converting cleaning functions to dagster ops:

One reason for converting cleaning functions to dagster ops is to validate data after each cleaning operation using dagster-pandera. How important is this to us? I think unit testing cleaning functions and using dagster-pandera to validate the outputs of table transform ops should be adequate. We also haven't talked about how to store and access the hundreds of pandera schemas for each cleaning step of every PUDL table.
Cleaning functions could be documented in the dagit UI and parallelized if they are ops. A majority of cleaning functions are applied sequentially, so there aren't many complex dependencies or opportunities to parallelize. It would be nice to view all of the cleaning functions applied to a table in Dagit, but users could also just visit the source code. I posted about this design question in the dagster community, and one engineer at dagster thought it might make more sense to keep our cleaning functions as pure python functions or methods.
It seems like many contributors want to tweak how tables are transformed. If we kept the cleaning functions as pure python, these contributors wouldn't need to learn dagster concepts. However, people contributing new datasets or tables would need to wrap functions in dagster abstractions.
Storing transform metadata and logic in classes could reduce coupling. This way, table transform functions, and cleaning functions don't need to know about the structure of the metadata.

I'm excited to use the integrated dataframe schemas provided by the Pandera/Dagster integration, but there are going to be a whole lot of table-level transformations and storing schemas for every one of them is probably overkill. I think we're going to be persisting more intermediate steps with the new setup than we were previously, and probably it makes sense to just check the schemas of the dataframes that we're persisting, since we'll need to think about those schemas for the database anyway.

It seems like something analogous to Option 4 could also be implemented by composing functions rather than using classes to store the transformation logic. The parameterizations of the transformations would be stored in the Pydantic classes, and then they would be consumed as arguments by individual transform functions. Only the top level table-specific transform functions (the functions containing all of the operations to be applied between persisted versions of a given table, like plants_steam_ferc1) would be Dagster ops, and they would know what table they pertain to, and be able to call both generic cleaning functions (looking up their parameters in the Pydantic data structure using the table ID) and table-specific functions which could be defined within the table-level cleaning function, or declared as _private_functions() elsewhere in the module.

I'm not sure which of these would feel cleaner, and will write some pseudo-code trying them both out.

catalyst-cooperative / pudl

Integrate FERC XBRL data into PUDL #1574

Background

FERC XBRL Data Acquisition (@zschira)

Issues

Pre-extract (@zschira)

Release Issues

Post-release Issues

Update Existing ETL

Phase 1: Software Design and Alpha Testing (@zaneselvans & @cmgosnell)

Issues

Phase 2: Beta Testing w/ Familiar Tables (#1801)

Issues

Phase 3: Integrate New FERC 1 Data & Methods

key

Issues

2023 Q1 Issues

Icebox

Option 3

Option 4

Vocab:

Background

Option 1: Use Dagster `ConfigSchema`s for each generic transform function

Pros

Cons

Option 2: Pass parameters from transform metadata classes into op factories

Pros

Cons

Option 3: Lookup relevant transform metadata within op factory

Pros

Cons

Option 4: Use transformer class instead of dagster ops for generic cleaning functions

catalyst-cooperative / pudl

Integrate FERC XBRL data into PUDL #1574

Background

FERC XBRL Data Acquisition (@zschira)

Issues

Pre-extract (@zschira)

Release Issues

Post-release Issues

Update Existing ETL

Phase 1: Software Design and Alpha Testing (@zaneselvans & @cmgosnell)

Issues

Phase 2: Beta Testing w/ Familiar Tables (#1801)

Issues

Phase 3: Integrate New FERC 1 Data & Methods

key

Issues

2023 Q1 Issues

Icebox

Option 3

Option 4

Vocab:

Background

Option 1: Use Dagster ConfigSchemas for each generic transform function

Pros

Cons

Option 2: Pass parameters from transform metadata classes into op factories

Pros

Cons

Option 3: Lookup relevant transform metadata within op factory

Pros

Cons

Option 4: Use transformer class instead of dagster ops for generic cleaning functions

Option 1: Use Dagster `ConfigSchema`s for each generic transform function