catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Use new metadata instead of PUDL_TABLES #1409

Closed zaneselvans closed 2 years ago

zaneselvans commented 2 years ago

The pudl.constants.PUDL_TABLES dictionary defines what database table names are valid arguments for the ETL process, but at this point that information should be stored elsewhere, either in the Pydantic models that define the database schema, or the models that are used for ETL settings validation. Remove pudl.constants.PUDL_TABLES and derive the same information from these other metadata sources instead. This might require some rejiggering of how these values are used, or how a subset of all the resource definitions can be associated directly with the data source (eia923, eia860, ferc1, etc.).

The requested tables are kind of like partitions, but they're the outputs, not the inputs. I'm thinking that the list of valid tables which can be requested should be dynamically generated from the metadata classes, but I think I may need to add another tag to each of the Resource definitions, to make it clear which dataset each of them belongs to -- there's group now, but all the EIA data are under one umbrella. And there are sources but some tables come from more than once source. It seems like these attributes maybe need a better organizing principle.

bendnorman commented 2 years ago

I tried to infer WORKING_TABLES from sources but some tables have multiple sources and I wasn't sure how to parse out supplemental tables like fuel_transportation_modes_eia and generators_entity_eia. Have these types of tables been moved to resources/eia.py?

Maybe instead of using source each settings class could use RESOURCE_METADATA from the dataset's resource/{dataset}.py file.

zaneselvans commented 2 years ago

@cmgosnell brought up the issue of whether we even want users to be able to modify which tables the ETL produces, based on the input settings -- we only test that processing all of them together works, and especially with the way that entity harvesting / resolution works, drawing data from many different un-normalized data tables to construct the authoratative entity tables, if you don't have all of the data tables, the outputs get wonky quickly.

So maybe the settings should only allow a user to specify the partitions (years/states/etc.) of data and the data sources, and all data tables associated with those sources will be integrated.

In development / testing / debugging we'll need to be able to limit / add tables, but that can be done via internal changes.