Closed zaneselvans closed 2 years ago
I tried to infer WORKING_TABLES
from sources
but some tables have multiple sources and I wasn't sure how to parse out supplemental tables like fuel_transportation_modes_eia
and generators_entity_eia
. Have these types of tables been moved to resources/eia.py
?
Maybe instead of using source
each settings class could use RESOURCE_METADATA
from the dataset's resource/{dataset}.py
file.
@cmgosnell brought up the issue of whether we even want users to be able to modify which tables the ETL produces, based on the input settings -- we only test that processing all of them together works, and especially with the way that entity harvesting / resolution works, drawing data from many different un-normalized data tables to construct the authoratative entity tables, if you don't have all of the data tables, the outputs get wonky quickly.
So maybe the settings should only allow a user to specify the partitions (years/states/etc.) of data and the data sources, and all data tables associated with those sources will be integrated.
In development / testing / debugging we'll need to be able to limit / add tables, but that can be done via internal changes.
The
pudl.constants.PUDL_TABLES
dictionary defines what database table names are valid arguments for the ETL process, but at this point that information should be stored elsewhere, either in the Pydantic models that define the database schema, or the models that are used for ETL settings validation. Removepudl.constants.PUDL_TABLES
and derive the same information from these other metadata sources instead. This might require some rejiggering of how these values are used, or how a subset of all the resource definitions can be associated directly with the data source (eia923, eia860, ferc1, etc.).The requested tables are kind of like partitions, but they're the outputs, not the inputs. I'm thinking that the list of valid tables which can be requested should be dynamically generated from the metadata classes, but I think I may need to add another tag to each of the Resource definitions, to make it clear which dataset each of them belongs to -- there's
group
now, but all the EIA data are under one umbrella. And there aresources
but some tables come from more than once source. It seems like these attributes maybe need a better organizing principle.