catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

compartmentalize ETL & glue based on data sets #195

Closed cmgosnell closed 5 years ago

cmgosnell commented 6 years ago

We want to allow various users to ingest only the data sets they care about. So we want to partition the ETL process such that anyone could ingest only the data sets that they are working on and only the glue connecting those data sets. Obviously this is only FERC/CEMS/EIA right now, but this should allow the ingest of 860 and CEMS but nothing else, for example.

Using the years input into init_db in order to determine whether or not a data set is being ingested, we can populate a small table with a record for each data source and a boolean to determine if it's been ingested for future reference in outputs and analysis.

cmgosnell commented 6 years ago

Now one could change the default arguments in the init script to not ingest any data set but there is more to do: changing the default arguments to ingest nothing unless otherwise states, create a configuration file for the init_pudl script and creating dependencies in the output and analysis modules. We could also require paired data sets to be ingested together - i.e. you can't pull in 923 w/o 860 or you can't pull in CEMS w/o 860. This would make debugging any ingest process more arduous but it would make for a necessarily more complete ingested database.