Transition load step to load datapackages

cmgosnell commented 5 years ago

Once the generation of the data package process is complete (issue #293), we'll need to transition the load step to take:

the dictionary of transformed dataframes
prepped metadata (static project info and table specs) and produce a data package with:
csv's from the transformed dfs
complete metadata

cmgosnell commented 5 years ago

Do we want to keep the ETL process constrained for each data set? We had previously done this to reduce the need to hold all of the data set, but from my understanding we'll need to prepare all of the tables we want for each package and load them as data packages all together.

Or can we load some tables into a package and add more later? If that is the case, we'll probably continue to want to keep the ETL process for each data set contained.

zaneselvans commented 5 years ago

I don't understand the distinction you're making here. Do you mean within a particular run of the ETL process? Like, should all of the dataframes containing all of the data be generated before we output any of them? I don't think that would work because of memory constraints (at least not if datasets like CEMS or the ISO/RTO or FERC EQR are involved).

The datapackage.json file isn't really "connected" to the CSV files in any way -- the metadata is only constrained to match the data itself by virtue of the validation process. So we can output whatever JSON we want, and then output the corresponding CSV files one at a time however we like, so long as we're ensuring that the CSVs match what we said in the JSON (otherwise validation will fail).

We could even build up a library of individual CSV files that we use as feedstocks for the data packages, and only re-generate the new CSVs when something changes in the code or data. Then putting together a data package would just be outputting the correct JSON, and copying the corresponding CSVs into the data directory of the datapackage.

cmgosnell commented 5 years ago

Oh no I'm not talking about the metadata... I'm wondering if we can generate a datapackage with one data set then add another dataset to it later.

I do love the idea of having one distinct process that generates CSV's for any/all tables and stores them somewhere with a separate process that generates the packages. This maybe resolves my weird desire to create the datapackage in a piecemeal fashion.

cmgosnell commented 5 years ago

@zaneselvans and I just had a convo about how this should all look. My notes are below:

transition to generate a mega datapackage with everything
- two step process (an etl-like process that generates CSVs, a metadata generator)
- Use hashes to validate from a CSV file to a tabular resource descriptor metadata
the way we will publish pkgs is to generate the smaller datapackages from the mega package
- we'll need to generate a glue package that allows us to connect other datapackages together. The other option for the glue tables is the pull them into the data specific datapackages (but then there will be multiple copies).

Need to know: if there is an id field or a good way to know that various dps can from the same version/release. Need to know: How would we go about taking an existing dp and chopping it into pieces? Are there existing dp tools easily to pull subsets of larger dps and generate their metadata?

catalyst-cooperative / pudl

Transition load step to load datapackages #294