catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
480 stars 110 forks source link

Transition load step to load datapackages #294

Closed cmgosnell closed 5 years ago

cmgosnell commented 5 years ago

Once the generation of the data package process is complete (issue #293), we'll need to transition the load step to take:

cmgosnell commented 5 years ago

Do we want to keep the ETL process constrained for each data set? We had previously done this to reduce the need to hold all of the data set, but from my understanding we'll need to prepare all of the tables we want for each package and load them as data packages all together.

Or can we load some tables into a package and add more later? If that is the case, we'll probably continue to want to keep the ETL process for each data set contained.

zaneselvans commented 5 years ago

I don't understand the distinction you're making here. Do you mean within a particular run of the ETL process? Like, should all of the dataframes containing all of the data be generated before we output any of them? I don't think that would work because of memory constraints (at least not if datasets like CEMS or the ISO/RTO or FERC EQR are involved).

The datapackage.json file isn't really "connected" to the CSV files in any way -- the metadata is only constrained to match the data itself by virtue of the validation process. So we can output whatever JSON we want, and then output the corresponding CSV files one at a time however we like, so long as we're ensuring that the CSVs match what we said in the JSON (otherwise validation will fail).

We could even build up a library of individual CSV files that we use as feedstocks for the data packages, and only re-generate the new CSVs when something changes in the code or data. Then putting together a data package would just be outputting the correct JSON, and copying the corresponding CSVs into the data directory of the datapackage.

cmgosnell commented 5 years ago

Oh no I'm not talking about the metadata... I'm wondering if we can generate a datapackage with one data set then add another dataset to it later.

I do love the idea of having one distinct process that generates CSV's for any/all tables and stores them somewhere with a separate process that generates the packages. This maybe resolves my weird desire to create the datapackage in a piecemeal fashion.

cmgosnell commented 5 years ago

@zaneselvans and I just had a convo about how this should all look. My notes are below:

Need to know: if there is an id field or a good way to know that various dps can from the same version/release. Need to know: How would we go about taking an existing dp and chopping it into pieces? Are there existing dp tools easily to pull subsets of larger dps and generate their metadata?