ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

EGAS00001004344 - LungCellAtlas2020 #889

Closed ofanobilbao closed 1 year ago

ofanobilbao commented 2 years ago

Project short name:

HumanLungCellAtlas2020

Primary Wrangler:

Ami --> Ida

Secondary Wrangler:

Arsenios

Associated files

Ami's note: Response from authors about bulk data: "The immune cells for bulk mRNA sequencing were from an entirely different source (the blood was purchased from https://allcells.com/).". I have decided to not include this bulk data, as it is not from the same sample types as the single cells.

I have also decided to not include the mouse data from Tabula Muris Senis (authors also let us know that almost all the mouse data is from that datasets). I have created a separate project for Tabula Muirs Senis in ingest as the main publication is separate.

Link to Ingest:

Published study links

Key Events

ami-day commented 2 years ago

Request new cell type ontology "immune cell": https://github.com/HumanCellAtlas/ontology/issues/120

ami-day commented 2 years ago

Challenging dataset: way too many cell barcodes to link to either new cell suspensions (like an experiment accession) or an analysis file in a single cell in excel or google sheets. We need to find a programmatic way to do the linking in ingest.

ami-day commented 2 years ago

Note about cell type ontologies: bvarner-ebi and addiehl suggest rather than 'immune cell', I could use 'leukocyte' and that "immune cell" should be added as a related synonym to the ontology term.

ami-day commented 2 years ago

Found a workaround to link all cell suspensions

ami-day commented 2 years ago

Emailed the authors about mouse single cell seq data and bulk data.

ami-day commented 2 years ago

Info from Kyle: "Ah I think I understand where the confusion is coming from. The accession in the paper only includes FASTQs for 608 SmartSeq2 cells specific to the study. The remainder are from the Tabula Muris Senis are available on that data portal. Of the 608, there are 1 set of FASTQs for the “Tbx4-Cre > ZsGreen1” cells (tagged with KJT) and 2 (across 2 sequencing lanes) sets for the “Axin2-CreER > mTmG” cells (tagged with ANN). 224 KJT + 2 * (384 ANN) = 992 sets of FASTQs. The cell IDs have the form “P-16-CGCTCAGT-TTATGCGA_ANNS256” (after concatenating the lanes). This represents [The 384 well it was sorted into][The i5 and i7 indices][the sorter, Ahmad versus myself][the bcl2fastq index during de-multiplexing]."

ami-day commented 2 years ago

Linking data files to samples in spreadsheet: on-going

ami-day commented 2 years ago

Requested NCBI data delivery (fastq)

ami-day commented 2 years ago

Also waiting on test data: cell suspension as input to cell_suspension

ami-day commented 2 years ago

504bec8b-733e-41d1-a5c3-5b19289036cd

ami-day commented 2 years ago

Dataset exceeds maximum cell suspension count - error in ingest. Creating a ticket ebi-ait/dcp-ingest-central#869 for this.

ami-day commented 2 years ago

@ESapenaVentura will be working on the technical side of this with @amnonkhen

idazucchi commented 2 years ago
What I know so far of this dataset Organism Techinque Specimens Cell suspensions Selection method Sequence file Analysis file
Human 10x 7 17 MACS 77
Human SS2 8 9409 MACS + FACS 1
Mouse 10x 16 63 1
Mouse SS2 2 608 - only 525 passed QC but fastqs are available for all cells
Human 9 9426
Mouse 18 671 at most

Mouse 10x data actually comes from Tabula Muris Senis

idazucchi commented 2 years ago

@idazucchi and @ESapenaVentura will discuss this today

idazucchi commented 1 year ago

I updated the metadata for human 10x and uploaded the analysis files to the hca-util area Working on human SS2 data --> too many cell suspensions to link to on analysis file, it hits the carachter limit for excel --> workaround is pooling the cell suspensions based on the plate (so waiting on #927) For the protocol I plan on applying the live cell selection but I welcome other suggestions

idazucchi commented 1 year ago

Issues

I am sending an email to sort out this information, hopefully I will get a reply soon

anu-shiva commented 1 year ago

@idazucchi Contacting the collaborators again

ofanobilbao commented 1 year ago

This is not on the high priority list, so while we wait it could be down prioritised

ofanobilbao commented 1 year ago

@idazucchi to chase one last time and close if no response

idazucchi commented 1 year ago

I emailed the authors again, if I don't get a reply in one month I'll close this ticket

idazucchi commented 1 year ago

we got a reply! I'll work on this dataset from next week, I'm trying to close the ones I have open at the moment

idazucchi commented 1 year ago

Open questions for review

idazucchi commented 1 year ago

mouse ss2 the cell ids from the metadata csv have been truncated: to match them up to the sequence file you need to discard the decimal digit for the plate well row

arschat commented 1 year ago

Start sec reviewing it.

arschat commented 1 year ago

Very well done Ida, on such an extensive dataset!

Donor

Cell suspension

Analysis files

All file and biomaterial mappings are verified. Awesome work!

idazucchi commented 1 year ago

exported and filled the import form!

idazucchi commented 1 year ago

verified in the browser