Cell suspension is required to get sequencing id for running pipelines, but is not a required metadata field

kbergin commented 5 years ago

See conversation starting here in slack: https://humancellatlas.slack.com/archives/C75G0R97E/p1534977101000100

Tony notes that there is no requirement for cell_suspension to be in every primary bundle. The pipelines rely on the metadata api for "sequencing_input" to run, which is obtained by the library using the not-required field cell_suspension. See code line here: https://github.com/HumanCellAtlas/metadata-api/blob/1b7192cecbef43b5befecc4153bf2e2f4db5bb16/src/humancellatlas/data/metadata/__init__.py#L579

Not safe to rely on this in the long term, it would seem.

hannes-ucsc commented 5 years ago

From the Slack convo: This is the bundle in question:

17ef531b-1bb7-425d-bbf7-32721242dde7.2018-08-17T203538.886280Z

hannes-ucsc commented 5 years ago

The big question is this: If there is no cell suspension, what is being sequenced? In the example bundle mentioned above, it is a specimen_from_organism that's being sequenced, which I find surprising.

malloryfreeberg commented 5 years ago

@hannes-ucsc for biomaterials that are already "fluids" - e.g. blood - there is a possibility that the entity path might be:

donor_organism -> [collection_protocol, enrichment_protocol] -> specimen_from_organism (peripheral blood mononuclear cells, or PBMCs) -> [library_preparation_protocol, sequencing_protocol] -> sequencing_file

In this case, the specimen_from_organism is already dissociated and data contributors might not supply metadata about the cell suspension.

There is a larger issue about making ultimate biomaterial assumptions in the context of imaging and future assays. There will most likely never be cell suspensions prior to an imaging assay; the metadata team is actually working on a new biomaterial type specific to the imaged specimen (name TBD). So, I don't think the DCP should make assumptions about which biomaterial type is the ultimate type prior to any sequencing or imaging or future assay.

kbergin commented 5 years ago

I would agree that making assumptions would not be a good idea, and what biomaterial it is doesn't currently matter for our pipelines anyways. The goal is to get the sequencing input in the best way (most reliable, fewest assumptions I guess?). If that's what you've outlined above, Mallory, than that seems like a good plan! @malloryfreeberg

tburdett commented 5 years ago

This logic is actually at the core of how the ingestion service generates bundles, and is covered nicely by this document. A primary assay bundle is one that constitutes a single "assay" - defined as any process which has a biomaterial as it's input and a file as it's output. This covers sequencing or imaging assays, and makes no assumptions about either the types of files that the assay generated or the type of inputs e.g. what was sequenced.

If this client followed a similar logic to get the ID of the "thing being sequenced", the only baked-in field lookup would be against the biomaterial_id from the input biomaterial. This field is in the biomaterial_core module, designed to be highly stable (and which hasn't changed since June).

tburdett commented 5 years ago

Actually, I should have gone further with my previous comment - whilst the biomaterial_core module was last modified in June, the biomaterial_core.biomaterial_id field has been stable since at least Feb 22

rexwangcc commented 5 years ago

@tburdett Thanks for the explanation here, I get most of them, except one which really confused me now, and maybe that is because I'm lacking in metadata domain knowledge here(correct me if I'm wrong):

the biomaterial_core.biomaterial_id field has been stable since at least Feb 22

If that's the case, I cannot understand why we had to make so many changes in the past few months:

Initially, from Dec last year to March, we got the required sample_id by : sample_id = assay_json['has_input']: https://github.com/HumanCellAtlas/pipeline-tools/blob/f4004dcbeea7cdc2cafa48816ca28b6a7b066914/adapter_pipelines/ss2_single_sample/adapter.wdl#L29
And then, in v4(4.6.1), we got the sample_id in another way fromassay.json: 'assay.json': https://github.com/HumanCellAtlas/pipeline-tools/blob/afca52edcfe2e9bb3ae285bbcb45079efbb058ed/pipeline_tools/input_utils.py#L72
In v5, we got the sample_id in another way fromlinks.json: 'links.json': https://github.com/HumanCellAtlas/pipeline-tools/blob/afca52edcfe2e9bb3ae285bbcb45079efbb058ed/pipeline_tools/input_utils.py#L67
In v6, we need to get the: source_id: https://github.com/HumanCellAtlas/pipeline-tools/blob/94ff74ee977a5d85f4231230d58a2d367d813198/pipeline_tools/input_utils.py#L45
And then we switched to get the sample_id by getting the cell_suspensionbiomaterial's document uuid, until now.

Some of the changes were for backward compatiblility, but in general, I'm confused if we are talking about the same sample_id or not? Looking at the document now, but feel free to correct me if I was wrong here!

tburdett commented 5 years ago

Hi Rex - no, you're right, we're indeed talking about the same thing. No biology domain knowledge required here! I'll leave aside the v4 changes; v4 to v5 was a very big breaking reorganisation.

The field itself biomaterial_core.biomaterial_id has then been stable since v5.0.0 (see https://schema.humancellatlas.org/core/biomaterial/5.0.0/biomaterial_core, that was released on 22nd Feb). I will concede though, the mechanism for actually getting to this field has changed given the links.json modifications that happened between v5 and v6; although IIRC those changes were an optimisation to help you grab the source_id more easily, and the v5 way should have continued to work (I think!).

I'm not sure of the rationale for the last change you describe to using cell_suspension. Can you explain why that was needed? The previous logic (using source_id from links.json) should work exactly as it did before... and actually I think uses that method (get uuid of input biomaterial) that I described above.

rexwangcc commented 5 years ago

Thanks! @tburdett I have a better understanding now, just wanted to make sure we are on the same page 😸

I didn't perform the last update in Green Box, so I cannot boldly assert why we didn't keep using source_id from links.json. However, I guess, one of the reasons is:

Each time we receive a notification(contains Bundle UUID and Bundle version) from DSS, based on the ElasticSearch query we had there, we will start a workflow and add some labels to the workflow, e.g. project_shortname, submitter_id, sample_id.

At this step, we want to minimize HTTP requests to DSS and keep Blue-> Green as single direction notifier-listener communication. So We are utilizing a feature DSS provided, which allows us to append additional queries ((like this: https://github.com/HumanCellAtlas/lira/blob/master/scripts/v6_queries/metadata_attachments.json)) along with the subscription query, so that DSS will include that additional metadata information into the notification, in addition to the bundle-uuid and bundle-version.

For some reason, the ES query is not as powerful as Python code, so I guess that's why we ended up with using the cell_suspension(which can be specified in ES query more easily than traversing links) in the query and in the python code consistently.

Above is just my inference, but I think @samanehsan can confirm?

Edited:

The information in this comment is actually misleading, added more in the latest comment now.

tburdett commented 5 years ago

OK, this is actually super handy info. I've not had a lot of exposure to the ES query API and this is a feature I wasn't aware of - it makes sense that you would exploit it in this way, although obviously it introduces the cell suspension assumption.

We really need a big hackathon session on this stuff I think. We created this issue this morning to try to navigate some of the confusion and it seems like a long session on this at the next face to face might be in order...

rexwangcc commented 5 years ago

Sorry, by revisiting our code base and thinking it more, I want to take part of my words back for clarification, indeed, we have been using the source_id+links approach since v5 schema for pipeline inputs.

I lied because I was also confused by all of the 3 places which were getting the sample_id:

Pipeline inputs preparation tools of Green Box (the actual inputs go into pipelines): Python code using source_id
Listener of Green Box(just to label the workflows, no effects on pipeline results): ES query using cell suspension assumption
This metadata-api library Python code using cell suspension assumption

Green Box has been using the first one to prepare inputs for SmartSeq2 pipeline, which I now think is the right way because it doesn't make any assumptions on biomaterials. The drawback is that for backward compatibility, as you can see from the link above, we have more than ~300 lines of code for just detecting the schema version and getting the right sample_id for each version.

Since we have this handy metadata-api recently, we were trying hard to use this library(and at the same time drop the support for v4) instead of keeping maintaining the ~300 lines of code. Besides this library, we also introduced the cell suspension assumption in the Listener's labeling ES query, which seems not like a good idea now.

~So, if using the source_id from links.json is the ultimate elixir here to get the sequencing input, I would re-think about the approach proposed in the pending PR #14 : https://github.com/HumanCellAtlas/metadata-api/pull/14/files#diff-fd2c027c58c7cefe444e40648aa02d20R581~ I compared the old approach which uses links.json and the proposed way, both returned the same info, but the proposed way in the PR is much more favorable.

What do you think? @tburdett @hannes-ucsc

hannes-ucsc commented 5 years ago

Thank you @malloryfreeberg and @tburdett for the insight. Based on Tony's comment above I created alternative PR #17 in addition to PR #14. I am not sure which alternative is preferred and tagged everyone for another round of reviews. Accept the one you think is preferable.

samanehsan commented 5 years ago

For some reason, the ES query is not as powerful as Python code, so I guess that's why we ended up with using the cell_suspension(which can be specified in ES query more easily than traversing links) in the query and in the python code consistently.

Yes, just confirming here that we were not able to get the sample_id from the links.json with one query, so we changed it to get the document id of the cell suspension biomaterial file. Since any biomaterial can be the sequencing input, ~I'm not sure what this query should look like now.~ the new query in the metadata attachments looks like this: https://github.com/HumanCellAtlas/lira/blob/master/scripts/vx_queries/metadata_attachments.json#L8

HumanCellAtlas / metadata-api

Cell suspension is required to get sequencing id for running pipelines, but is not a required metadata field #13