HumanCellAtlas / dcp2

Shared artifacts concerning the Human Cell Atlas (HCA) Data Coordination Platform (DCP)
4 stars 2 forks source link

Data Browser shows meaningless strings in Analysis Protocol facet drop-down #51

Open hannes-ucsc opened 2 years ago

hannes-ucsc commented 2 years ago

image

Note the analysis_protocol_1 entry.

hannes-ucsc commented 2 years ago

DCP/2 analysis uses protocol_core.protocol_id to identify the workflow as in optimus_v4.2.3. That's the only meaningful property they populate, the rest is boiler-plate. Here's an example:

{
    "computational_method": "Optimus",
    "describedBy": "https://schema.humancellatlas.org/type/protocol/analysis/9.1.0/analysis_protocol",
    "protocol_core": {
        "protocol_id": "optimus_v4.2.3"
    },
    "provenance": {
        "document_id": "54e9804d-958d-584f-aa66-243bcedff6dd",
        "submission_date": "2021-04-17T08:07:00.000000Z",
        "update_date": "2021-04-17T08:07:00.000000Z"
    },
    "schema_type": "protocol",
    "type": {
        "text": "analysis_protocol"
    }
}

Azul indexes that property and the Data Browser displays it various places. The use of an ID for a human-readable display is a hacky but it is what we agreed on during DCP/1.

When organically described CGM's came along, the wranglers used generic sequential ID for the protocols that are attached to the approximate process instance. The use of generic (meaning-less) IDs has been a long-term practices for many wrangler-allocated IDs, e.g. biomaterial_1, cell_suspension_3. Nothing wrong with that either, except when the lab already allocated IDs in which case the wranglers should use the lab-allocated IDs instead of minting their own.

It's just that these two practices starting to collide when organic CGM were introduced.

Here's an example of a organic CGM analysis protocol:

{
    "describedBy": "https://schema.humancellatlas.org/type/protocol/analysis/9.2.0/analysis_protocol",
    "schema_type": "protocol",
    "protocol_core": {
        "protocol_id": "analysis_protocol_1",
        "protocol_name": "Cellranger",
        "protocol_description": "The 10X Genomics Cell Ranger pipeline (version 5.0.1) was used to perform sample demultiplexing, alignment to the hg38 human reference genome (refdata-gex-GRCh38-2020-A, 10x Genomics), barcode/UMI processing, and gene counting for each cell."
    },
    "type": {
        "text": "data transformation",
        "ontology": "OBI:0200000",
        "ontology_label": "data transformation"
    },
    "computational_method": "Cellranger mkfastq",
    "matrix": {
        "data_normalization_methods": [
            "other"
        ],
        "derivation_process": [
            "alignment"
        ]
    },
    "provenance": {
        "document_id": "ea6ce706-6c92-4b80-8522-55b86b676083",
        "submission_date": "2021-08-11T22:51:23.658Z",
        "update_date": "2021-08-11T22:51:29.099Z",
        "schema_major_version": 9,
        "schema_minor_version": 2
    }
}
hannes-ucsc commented 2 years ago

The solution is to reconcile the differences. The protocol_id that's used for organic CGMs is too generic to have any utility because it doesn't allow metadata consumers to correlate analysis protocols between projects. If two projects use the same protocol, the corresponding protocol_id values should be identical. As things are right now, one could be analysis_protocol_1 and the other could be analysis_protocol_2, hiding the identity. Worse, if two different protocols were used (CellRanger and FooBar) both could easily use analysis_protocol_1, falsely indicating identity.

Wranglers should use meaningful values for protocol_id and coordinate between each other to use those values consistently accross projects.

DCP/2 analyses should populate protocol_name with a human-readable, unique name for the protocol.

TL;DR:

In the above organic CGM protocol example, protocol_id should be cellranger_5.0.1 and the protocol_name should be CellRanger 5.0.1. In the DCP/2 analysis protocol example, protocol_name should be Optimus 4.2.3. Azul and DB should switch to using protocol_name.

hannes-ucsc commented 2 years ago

@nikellepetrillo @ami-day

nikellepetrillo commented 2 years ago

@hannes-ucsc We are a little worried about the space in the project name "Optimus 4.2.3". Do you forsee any issues with that? If the space works for you, that is fine for us, I just wanted to call attention to it

hannes-ucsc commented 2 years ago

We will treat protocol_core.protocol_name as the the user-friendly, human-readable form of protocol_core.protocol_id so a space should be used there to separate words. I do not foresee any issues with that.