catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

One big package or many small packages? Or Both?! #319

Closed cmgosnell closed 5 years ago

cmgosnell commented 5 years ago

We have a design decision to make regarding how we are bundling up the data packages. For speed and usability it would be nice to be able to publish data-source specific packages.

We were thinking it might be good to have generate one mega datapackage with everything and then pull from that package to generate the data-source specific packages that we actually publish. The main reasoning for this was to ensure that the small packages were generated at the same time (if the cleaning process or underlying data are changed we don't want users to be able to have a newly published package talk to an old package).

If we are going the many packages route:

Any thoughts on any of this @lwinfree @roll @zaneselvans??

lwinfree commented 5 years ago

Thanks for the write-up @cmgosnell! From our end, it sounds like creating a mega-datapackage is technically feasible. It is probably also the best option for keeping the datasets 'linked' as our foreign key functionality is currently only supported insides datapackages (ie, it isn't supported outside/across multiple datapackages). @roll - any other thoughts? As we said on the call today, this decision can (and should) be made while the work is being done, and is not necessarily blocking at this point.

zaneselvans commented 5 years ago

From the point of view of the datapackage specification, clearly putting all of these related tables in the same package is the "right" thing to do, but practically I think would end up being suboptimal in a few ways.

roll commented 5 years ago

It would say that it's not really important how it's stored internally:

The real decision is how to publish it.

zaneselvans commented 5 years ago

Yes... I think the structure of the published information is what we're discussing -- using the big data package internally should totally work, and be convenient given that we're already working with all the data and data packages as a structure. But what is the best way for us to publish many GB of data that's not tightly interrelated, but which users should be able to recombine reliably at will for their own purposes?

roll commented 5 years ago

We've been having a long time discussion regarding external foreign keys in the libraries. On the specs level, it's a controversial topic because it kinda breaks the idea of self-sustained data packages. But on the libs level, I think it's OK to implement cross-packages integrity checks and field resolutions (dereferencing). Actually, it should be relatively easy to implement on top of the current implementation - https://github.com/frictionlessdata/datapackage-py/blob/29a9e34a924a4187d9587549123a7d262cacdbf7/datapackage/resource.py#L385 (here we just need to add support for external data packages)

Probably, this feature can be a really good thing to implement as a part of the pilot (cc @lwinfree). And some other projects have already asked for this feature too.

roll commented 5 years ago

So having a data package with an external foreign key will work (if we implement it) exactly the same checks/resolution-wise as it works for internal foreign keys - https://github.com/frictionlessdata/datapackage-py#foreign-keys

And based on this thread, I think, it will fit really well for your plan of publishing small data packages which users should be able to recombine reliably at will for their own purposes.