One big package or many small packages? Or Both?!

cmgosnell commented 5 years ago

We have a design decision to make regarding how we are bundling up the data packages. For speed and usability it would be nice to be able to publish data-source specific packages.

We were thinking it might be good to have generate one mega datapackage with everything and then pull from that package to generate the data-source specific packages that we actually publish. The main reasoning for this was to ensure that the small packages were generated at the same time (if the cleaning process or underlying data are changed we don't want users to be able to have a newly published package talk to an old package).

If we are going the many packages route:

We need a way to validate that two packages were generated at the same time. Some ideas of how to do this:
- by going the mega-package to mini-package route and having some key that all of the subsequent packages share
- by using the Data Package Version or UUIDs assigned in one script that generates multiple mini packages.
- We need a good way to have different datapackages talk to each other. Right now we have glue that connects various data sets to each other. As an example, we probably would want to publish separate packages for: EIA, FERC1 and glue... that way the glue isn't published in multiple places and a user who only wants EIA or only wants FERC doesn't need to download all three sets. It looks like there is a Table Schema: Foreign Keys to Data Packages spec that appears to be generally what we want.

Any thoughts on any of this @lwinfree @roll @zaneselvans??

lwinfree commented 5 years ago

Thanks for the write-up @cmgosnell! From our end, it sounds like creating a mega-datapackage is technically feasible. It is probably also the best option for keeping the datasets 'linked' as our foreign key functionality is currently only supported insides datapackages (ie, it isn't supported outside/across multiple datapackages). @roll - any other thoughts? As we said on the call today, this decision can (and should) be made while the work is being done, and is not necessarily blocking at this point.

zaneselvans commented 5 years ago

From the point of view of the datapackage specification, clearly putting all of these related tables in the same package is the "right" thing to do, but practically I think would end up being suboptimal in a few ways.

As we add more datasets, the mega datapackage will become quite large. (FERC EQR, the RTO/ISO data and the EPA CEMS data are each on the order of 100GB uncompressed and more like 10GB compressed.)
Many users will not need all the different datasets. Asking someone to download and work with or around 100GB of data they don't actually want, in order to get at the 100MB of data they do want seems impolite.
Each of the datasets really are useful in their own right as standalone outputs.
Many of the datasets have static internal identifiers that are issued by federal agencies, like EIA and EPA, which can be usefully used to connect the datasets to each other, and to other things in the broader universe.
Some of our internal IDs, and other types of derived values and properties that rely on more than one dataset really should only be paired with PUDL outputs from the same ETL run to ensure internal consistency.
Generally I can see that without some canonical datapackage repository like datahub.io or a CKAN successor, having foreign key relationships to other datapackages would be dicey. But in this case, in theory we control all the datapackages that we want folks to be able to link together, and so should be able to ensure that they are all published together so folks can choose whichever ones they want to combine. And if people are just linking to other outside data using the fixed IDs, then this internal consistency within PUDL isn't important for them.

roll commented 5 years ago

It would say that it's not really important how it's stored internally:

list of pairs of data file + table schema
some mega datapackage using custom metadata
etc

The real decision is how to publish it.

zaneselvans commented 5 years ago

Yes... I think the structure of the published information is what we're discussing -- using the big data package internally should totally work, and be convenient given that we're already working with all the data and data packages as a structure. But what is the best way for us to publish many GB of data that's not tightly interrelated, but which users should be able to recombine reliably at will for their own purposes?

roll commented 5 years ago

We've been having a long time discussion regarding external foreign keys in the libraries. On the specs level, it's a controversial topic because it kinda breaks the idea of self-sustained data packages. But on the libs level, I think it's OK to implement cross-packages integrity checks and field resolutions (dereferencing). Actually, it should be relatively easy to implement on top of the current implementation - https://github.com/frictionlessdata/datapackage-py/blob/29a9e34a924a4187d9587549123a7d262cacdbf7/datapackage/resource.py#L385 (here we just need to add support for external data packages)

Probably, this feature can be a really good thing to implement as a part of the pilot (cc @lwinfree). And some other projects have already asked for this feature too.

roll commented 5 years ago

So having a data package with an external foreign key will work (if we implement it) exactly the same checks/resolution-wise as it works for internal foreign keys - https://github.com/frictionlessdata/datapackage-py#foreign-keys

And based on this thread, I think, it will fit really well for your plan of publishing small data packages which users should be able to recombine reliably at will for their own purposes.

catalyst-cooperative / pudl

One big package or many small packages? Or Both?! #319