Closed cmgosnell closed 5 years ago
Do we want to keep the ETL process constrained for each data set? We had previously done this to reduce the need to hold all of the data set, but from my understanding we'll need to prepare all of the tables we want for each package and load them as data packages all together.
Or can we load some tables into a package and add more later? If that is the case, we'll probably continue to want to keep the ETL process for each data set contained.
I don't understand the distinction you're making here. Do you mean within a particular run of the ETL process? Like, should all of the dataframes containing all of the data be generated before we output any of them? I don't think that would work because of memory constraints (at least not if datasets like CEMS or the ISO/RTO or FERC EQR are involved).
The datapackage.json
file isn't really "connected" to the CSV files in any way -- the metadata is only constrained to match the data itself by virtue of the validation process. So we can output whatever JSON we want, and then output the corresponding CSV files one at a time however we like, so long as we're ensuring that the CSVs match what we said in the JSON (otherwise validation will fail).
We could even build up a library of individual CSV files that we use as feedstocks for the data packages, and only re-generate the new CSVs when something changes in the code or data. Then putting together a data package would just be outputting the correct JSON, and copying the corresponding CSVs into the data
directory of the datapackage.
Oh no I'm not talking about the metadata... I'm wondering if we can generate a datapackage with one data set then add another dataset to it later.
I do love the idea of having one distinct process that generates CSV's for any/all tables and stores them somewhere with a separate process that generates the packages. This maybe resolves my weird desire to create the datapackage in a piecemeal fashion.
@zaneselvans and I just had a convo about how this should all look. My notes are below:
Need to know: if there is an id field or a good way to know that various dps can from the same version/release. Need to know: How would we go about taking an existing dp and chopping it into pieces? Are there existing dp tools easily to pull subsets of larger dps and generate their metadata?
Once the generation of the data package process is complete (issue #293), we'll need to transition the load step to take: