catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Separate extraction process from ingest #131

Closed cmgosnell closed 6 years ago

cmgosnell commented 6 years ago

The extraction portion of the process will include a module for each datasource. This will include the current datasource.py modules, which pull from the datasource to create the raw dataframes. The extraction modules will now also include the first steps of the current ingest functions for each table (i.e. section of the applicable dataframe and discarding the columns that aren't needed). Each table specific extraction function should output a dataframe.

End result:

cmgosnell commented 6 years ago

We've decided to only include the first two bullets from the end results above in this section. Any dropping of columns will be moved into the transform section.