HumanCellAtlas / data-store

Design specs and prototypes for the HCA Data Storage System (DSS, "blue box")
https://dss.staging.data.humancellatlas.org/
Other
40 stars 6 forks source link

Support bundles of bundles #575

Open kozbo opened 6 years ago

kozbo commented 6 years ago

Supporting bundles of bundles will require solving a few different problems.

Use case: Analysis bundles produced from multiple input bundles.

ttung commented 6 years ago

Why does this need to be explicit?

mikebaumann commented 6 years ago

I think an analysis bundle produced from multiple input bundles could/would contain multiple files of the same name (e.g. multiple sample.json files, etc.). I don't see how this could be handled by our currently specified index document structure, as the names would collide in "files" section of the index document (or whatever the common namespace is if "files" is removed). So, at a minimum, I think there needs to be a change to the overall index document structure we are currently using. One possibility that has been proposed for handling this is have "files.sample_json" (for example) be of type array, with the various instances of the sample.json becoming elements of this array. If we are using Elasticsearch, to correctly search an array of objects, these may need to be treated as nested documents and may need to be queried with nested query. I say "may" because I haven't tried it, but that seems to be suggested by: https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html and https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-nested-query.html.

ttung commented 6 years ago

Why not do:

<bundle_uuid_0><bundle_version_0>/sample.json <bundle_uuid_1><bundle_version_1>/sample.json <bundle_uuid_2><bundle_version_2>/sample.json

?