ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

Incorrect biomaterial mapping (cell suspension <-> file) #513

Open pnejad opened 3 years ago

pnejad commented 3 years ago

Description of the task:

@hannes-ucsc pointed out that these 4 datasets have at least 1 specimen which was the sequencing input.

https://data.humancellatlas.org/explore/projects/6c040a93-8cf8-4fd5-98de-2297eb07e9f6 (EBI) https://data.humancellatlas.org/explore/projects/71eb5f6d-cee0-4297-b503-b1125909b8c7 (EBI) https://data.humancellatlas.org/explore/projects/c4077b3c-5c98-4d26-a614-246d12c2e5d7 (EBI) https://data.humancellatlas.org/explore/projects/7adede6a-0ab7-45e6-9b67-ffe7466bec1f (UCSC)

7adede6a-0ab7-45e6-9b67-ffe7466bec1f - The specimen ID was used accidentally instead of the cell suspension ID in the sequence file tab. @rachadele will update this and resubmit the dataset.

For the other 3 datasets, I was not able to download the submitted spreadsheets from ingest (Error - Service Unavailable) to troubleshoot. I did find other things that need to be fixed based on the info shown on the data portal project pages. We can discuss this further during our next wrangler call.

rachadele commented 3 years ago

I'm not able to download the spreadsheet for 7adede6a-0ab7-45e6-9b67-ffe7466bec1f either.

Wkt8 commented 3 years ago

@ESapenaVentura can you confirm that the 3 EBI datasets were mixed bulk RNA-seq and hence were intentionally set to have specimens linking into sequencing input?

ESapenaVentura commented 3 years ago

@Wkt8 I can confirm the 3 EBI datasets were Bulk + single cell RNA seq!

hannes-ucsc commented 3 years ago

intentionally set to have specimens linking into sequencing input

The specimens are the sequencing input.

pnejad commented 3 years ago

@hannes-ucsc cell suspensions are not limited to single cells. So even for bulk experiments, the specimen is processed into suspensions of cells before the bulk-RNA-seq is carried out.

pnejad commented 3 years ago

experimental_setup_quake

hannes-ucsc commented 3 years ago

What I meant was that in the current metadata graphs for these projects the specimen_from_organism entities are the sequencing input, instead of being linked to sequencing input. Whether the current metadata correctly describes the experiment in reality is another question, one that you all, the wranglers need to agree on. We can't have these types of ~projects~ experiments modeled one way by one team, and another way by another team.

hannes-ucsc commented 3 years ago

These are the only four projects that have a sequencing input that is not a cell suspension. If 1) consensus is that even in these projects a cell suspension was actually used as the sequencing input and 2) the metadata for these projects is updated to reflect that, and 3) consensus is that sequencing input has to be a cell suspension for all types of experiments, then we can remove the concept of sequencing input. I initially introduced it based on these statements by Mallory and Tony (@tburdett):

https://github.com/HumanCellAtlas/metadata-api/issues/13#issuecomment-415337446 https://github.com/HumanCellAtlas/metadata-api/issues/13#issuecomment-415564276

willrockout commented 3 years ago

@Wkt8 @ESapenaVentura Why would you set specimen as sequencing input for bulk data? A cell suspension is just a pool of cells that would still need to be created for library prep even in bulk. The only difference is they didn't separate them into single cells but that's how we handle 10x data.

pnejad commented 3 years ago

I do not agree with Mallory's blood example. Blood that is collected from a donor still needs to be enriched for PBMC's and put into cell suspensions before the library prep protocol is carried out.

I do agree with her statement that "...there will most likely never be cell suspensions prior to an imaging assay". But then again, I don't think an imaged specimen would be the sequencing_input. Wranglers - I'm not an expert when it comes to imaging assays, so please correct me if I'm wrong here.

pnejad commented 3 years ago

Same data being modelled differently by different submitters (wranglers) has been on my mind lately. Right now we have submitters from the EBI, UCSC, and Lattice teams. I would not be surprised if there are more datasets in the DCP with inconsistent metadata. I think this will increase as the number of submitters to the DCP increases over time.

It would be really helpful if there was a way to flag these inconsistencies (maybe via ingest during submission? QA process/team?) and to revisit our wrangling guides frequently to make sure all teams are aligned. Thoughts or suggestions @gabsie @tburdett?

hannes-ucsc commented 3 years ago

Off topic, but one way to achieve consistency is to review submissions across teams, just like peer reviews of PRs on Github.

hannes-ucsc commented 3 years ago

@ami-day

ami-day commented 2 years ago

@ESapenaVentura is going to test this

ESapenaVentura commented 2 years ago

@ami-day what? we didn't discuss this on stand-up

ofanobilbao commented 1 year ago

@ESapenaVentura is this still required? Do you know?

ESapenaVentura commented 1 year ago

This is still needed