Open hannes-ucsc opened 3 years ago
TBD: The input subgraph snapshot, or short "input snapshot" could be referenced by UUID or name. See https://github.com/databiosphere/azul/issues/2625#issuecomment-744762785 for more.
@hannes-ucsc spike to outline alternatives (input and analysis bundle in same snapshot).
Here's what I added to the spec but later discarded:
Snapshot provenance
For MVP, all primary and analysis data resides in a single TDR snapshot. Post-MVP data releases will likely consist of multiple snapshots. There might be a snapshot per combination of project, access-control status (public, access controlled) and analysis level (primary vs secondary or tertiary analysis). This will facilitate controlling access to the data as well as accelerate the process of cutting and indexing snapshots. In anticipation of this, an MVP analysis subgraph must declare the snapshot from which its input files were taken from. Without this declaration, the subgraph stitching algorithm would have to look for entities in every subgraph.
Re spike: See my update to the description.
The schema specifies a restrictive pattern for the input_id
property, one that only matches UUIDs, so we can't just prepend the UUID with a snapshot reference. IOW, this requires a change to the schema. The alternatives are
"inputs": [
{
"input_type": "sequence_file",
"input_id": "b4d87b98-4a6e-4fb5-a9bb-2067c87023f9:f0b30770-a65a-429a-b7e1-858a7637f700"
}
]
"inputs": [
{
"input_type": "sequence_file",
"input_id": "hca_dev_20201203___20210219:f0b30770-a65a-429a-b7e1-858a7637f700"
}
]
Pro: No change to the JSON structure, the only schema change required is to extend the pattern that restricts the property value Con: Hacky, needs special parsing rule (not pure JSON) Con: Legacy consumers will fail in less obvious ways (fail to find entity rather than fail to parse)
"inputs": [
{
"input_type": "sequence_file",
"input_id": "f0b30770-a65a-429a-b7e1-858a7637f700",
"snapshot_id": "b4d87b98-4a6e-4fb5-a9bb-2067c87023f9"
}
]
"inputs": [
{
"input_type": "sequence_file",
"input_id": "f0b30770-a65a-429a-b7e1-858a7637f700",
"snapshot_name": "hca_dev_20201203___20210219"
}
]
Pro: Clean, JSONy Pro: Simple schema (adding optional property) Cons: Old readers will fail in less obvious ways (fail to find entity, rather than failing to parse)
"inputs": [
{
"input_type": "sequence_file",
"input_id": {
"entity": "f0b30770-a65a-429a-b7e1-858a7637f700",
"snapshot": "b4d87b98-4a6e-4fb5-a9bb-2067c87023f9"
}
}
]
"inputs": [
{
"input_type": "sequence_file",
"input_id": {
"entity": "f0b30770-a65a-429a-b7e1-858a7637f700",
"snapshot": "hca_dev_20201203___20210219"
}
}
]
Pro: Clean
Pro: Old clients fail in obvious ways
Con: Complex schema (needs to support non-composite variant using oneOf
)
Contrary to what I mentioned in stand-up, absence of this feature will not cause CGMs (neither hacky nor organic) to disappear if they coexist with access controlled (meta)data in the same project. We can put CGM subgraphs and access-controlled subgraphs in separate snapshots, even without this feature. This feature is only needed to index subgraphs with DCP/2-generated matrices, because only those involve stitching.
@theathorn
Discussed with Kathleen and supporting DCP/2-generated matrices in a mixed open/controlled-access project is not a requirement for the initial phase. If we have a project with DCP/2-generated matrices, we can start by making the whole project controlled access.
Just to confirm: Until we implement this feature, we are OK with preventing public access to DCP/2-generated matrices for access-controlled projects?
Correct. Unauthenticated users should be able to see the project description for an access-controlled project but would need to apply for (and be granted) access to see any other data in the project, including the DCP/2-generated matrices.
@hannes-ucsc "Recording an idea: have the importer or some other Monster component rewrite the entity references in links.json to include the snapshot name. That might be difficult because the table content would have to be rewritten before the snapshot is cut".
In
links.json
of analysis subgraphs, anyinput_id
referring tosequence_file
entities from input subgraphs must refer to the snapshot the input subgraph is part of. The pipeline adapter gets the snapshot reference from the hand-over manifest, an additional column implemented as part of #2625.Update: This is only needed if the analysis subgraph is in a different snapshot compared to its inputs. That will need to be the case if we want to make DCP/2-generated matrices for access-controlled projects publicly accessible.