DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Amend spec to allow fully qualified reference to process inputs #2626

Open hannes-ucsc opened 3 years ago

hannes-ucsc commented 3 years ago

In links.json of analysis subgraphs, any input_id referring to sequence_file entities from input subgraphs must refer to the snapshot the input subgraph is part of. The pipeline adapter gets the snapshot reference from the hand-over manifest, an additional column implemented as part of #2625.

Update: This is only needed if the analysis subgraph is in a different snapshot compared to its inputs. That will need to be the case if we want to make DCP/2-generated matrices for access-controlled projects publicly accessible.

hannes-ucsc commented 3 years ago

TBD: The input subgraph snapshot, or short "input snapshot" could be referenced by UUID or name. See https://github.com/databiosphere/azul/issues/2625#issuecomment-744762785 for more.

melainalegaspi commented 3 years ago

@hannes-ucsc spike to outline alternatives (input and analysis bundle in same snapshot).

hannes-ucsc commented 3 years ago

Here's what I added to the spec but later discarded:

Snapshot provenance

For MVP, all primary and analysis data resides in a single TDR snapshot. Post-MVP data releases will likely consist of multiple snapshots. There might be a snapshot per combination of project, access-control status (public, access controlled) and analysis level (primary vs secondary or tertiary analysis). This will facilitate controlling access to the data as well as accelerate the process of cutting and indexing snapshots. In anticipation of this, an MVP analysis subgraph must declare the snapshot from which its input files were taken from. Without this declaration, the subgraph stitching algorithm would have to look for entities in every subgraph.

melainalegaspi commented 3 years ago

See https://github.com/DataBiosphere/azul/issues/2627#issuecomment-807543807

hannes-ucsc commented 3 years ago

Re spike: See my update to the description.

The schema specifies a restrictive pattern for the input_id property, one that only matches UUIDs, so we can't just prepend the UUID with a snapshot reference. IOW, this requires a change to the schema. The alternatives are

String concatenation

    "inputs": [
        {
            "input_type": "sequence_file",
            "input_id": "b4d87b98-4a6e-4fb5-a9bb-2067c87023f9:f0b30770-a65a-429a-b7e1-858a7637f700"
        }
    ]
    "inputs": [
        {
            "input_type": "sequence_file",
            "input_id": "hca_dev_20201203___20210219:f0b30770-a65a-429a-b7e1-858a7637f700"
        }
    ]

Pro: No change to the JSON structure, the only schema change required is to extend the pattern that restricts the property value Con: Hacky, needs special parsing rule (not pure JSON) Con: Legacy consumers will fail in less obvious ways (fail to find entity rather than fail to parse)

Sibling property

    "inputs": [
        {
            "input_type": "sequence_file",
            "input_id": "f0b30770-a65a-429a-b7e1-858a7637f700",
            "snapshot_id": "b4d87b98-4a6e-4fb5-a9bb-2067c87023f9"
        }
    ]
    "inputs": [
        {
            "input_type": "sequence_file",
            "input_id": "f0b30770-a65a-429a-b7e1-858a7637f700",
            "snapshot_name": "hca_dev_20201203___20210219"
        }
    ]

Pro: Clean, JSONy Pro: Simple schema (adding optional property) Cons: Old readers will fail in less obvious ways (fail to find entity, rather than failing to parse)

Composite property

    "inputs": [
        {
            "input_type": "sequence_file",
            "input_id": {
                "entity": "f0b30770-a65a-429a-b7e1-858a7637f700",
                "snapshot": "b4d87b98-4a6e-4fb5-a9bb-2067c87023f9"
            }
        }
    ]
    "inputs": [
        {
            "input_type": "sequence_file",
            "input_id": {
                "entity": "f0b30770-a65a-429a-b7e1-858a7637f700",
                "snapshot": "hca_dev_20201203___20210219"
            }
        }
    ]

Pro: Clean Pro: Old clients fail in obvious ways Con: Complex schema (needs to support non-composite variant using oneOf)

hannes-ucsc commented 3 years ago

Contrary to what I mentioned in stand-up, absence of this feature will not cause CGMs (neither hacky nor organic) to disappear if they coexist with access controlled (meta)data in the same project. We can put CGM subgraphs and access-controlled subgraphs in separate snapshots, even without this feature. This feature is only needed to index subgraphs with DCP/2-generated matrices, because only those involve stitching.

@theathorn

theathorn commented 3 years ago

Discussed with Kathleen and supporting DCP/2-generated matrices in a mixed open/controlled-access project is not a requirement for the initial phase. If we have a project with DCP/2-generated matrices, we can start by making the whole project controlled access.

hannes-ucsc commented 3 years ago

Just to confirm: Until we implement this feature, we are OK with preventing public access to DCP/2-generated matrices for access-controlled projects?

theathorn commented 3 years ago

Correct. Unauthenticated users should be able to see the project description for an access-controlled project but would need to apply for (and be granted) access to see any other data in the project, including the DCP/2-generated matrices.

theathorn commented 2 years ago

@hannes-ucsc "Recording an idea: have the importer or some other Monster component rewrite the entity references in links.json to include the snapshot name. That might be difficult because the table content would have to be rewritten before the snapshot is cut".