Mapping the developing human immune system across organs

ami-day commented 2 years ago

project short name

DevelopingImmuneSystem

Primary wrangler

Ami

Secondary wrangler

Ida

Ingest

https://contribute.data.humancellatlas.org/projects/detail?uuid=fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a&tab=project

submission https://contribute.data.humancellatlas.org/submissions/detail?uuid=8c987aea-34b5-4f17-ba4d-cc6726730638&project=fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a

Publication:

preprint Mapping the developing human immune system across organs

Data

E-MTAB-11341 10X Genomics Visium
E-MTAB-11343 scRNA-seq
E-MTAB-11388 scVDJ-seq

Google Sheet:

Latest (De-duped samples): https://docs.google.com/spreadsheets/d/1lSEiH_ZS-H8xtTOxJ9NTq8WhXIiqI8pKvFUcZhakq84/edit

1st version https://docs.google.com/spreadsheets/d/1GJbl-UOWNXvkcV7Qsx0em7x-DBaqR8pxPJ9p6v0oOi4/edit#gid=1259194338

ami-day commented 2 years ago

Emma has sent the ArrayExpress login details for the scRNA-seq and scVDJ-seq datasets. The 10X visium log-in will be sent soon.

ami-day commented 2 years ago

Moving this to stalled and from release 14 because we need additional info from the authors, and they have not yet made their data public yet (I have a private ArrayExpress log-in).

ami-day commented 2 years ago

Emma emailed to say they are updating the data and adding new metadata. I said we can aim to get it submitted for release 17 (May release).

ofanobilbao commented 2 years ago

@ami-day why is this in stalled? Does not have any labels to explain. If so, are we still aiming at Release 17?

ami-day commented 2 years ago

They aren't ready to submit it yet. They are generating more data.

gabsie commented 2 years ago

Hey @ami-day , @ofanobilbao - can we check whether these people are ready now? We just saw this advertised in the HCA opening slides. Then we can maybe prioritise for this release.

ofanobilbao commented 2 years ago

@gabsie I've moved it to Wrangling for @ami-day to prioritise

ami-day commented 2 years ago

I have an email exchange with Emma (author), we decided it would be best to wait until they have the ArrayExpress accessions and those datasets published, so that I can use that metadata as a template. When we spoke they had some data ready but not the full project. When they get back to me as confirmed, I will move forward with this ticket.

gabsie commented 2 years ago

Thanks, @ami-day - it might be nice to remind them. :) Say we have seen the slide at the HCA meeting, in case they have forgotten about us.

ami-day commented 2 years ago

The new data is now ready to curate (Emma emailed about it).

MightyAx commented 2 years ago

@idazucchi to secondary review

ESapenaVentura commented 2 years ago

@idazucchi is almost ready with the secondary review

idazucchi commented 2 years ago

Hi Ami, I'm done with the secondary review! Let me know if you want to discuss anything

Project

the project cell count reflects the cell count of the h5ad file, which is fine. However the h5ad includes 3 donors that are not included in ArrayExpress and the cell count is lower than what is reported in the paper 908,178. I think it would be worth checking with the authors why the donors are not included in ArrayExpress. If they are rightfully part of the study we can add the donors and link their cell suspensions directly to the h5ad file
grant missing
- Wellcome 108413/A/15/D
- ERC Consolidator Grant ThDEFINE (646794)
- NIHR Research Professorship (RP-2017-08-ST2-002)

Donor

would it be possible to obtain the HDBR accessions for the donors?
Donors F34 F64 F72 F73 F78 have sex and age info available in the supplementary table S7
I know that the authors annotated all donors as late embryonic stage, but I think that only donors aged 7-8W can be classified as late embryonic stage. From week 9 all donors are in the fetal stage from http://www.dxline.info/diseases/fetal-development

The end of the eighth week marks the end of the "embryonic period" and the beginning of the "fetal period."

Specimen

scRNA seq
- I disagree with how the specimens are modelled. I would represent each organ from a given donor with a single specimen because there is no distinguishable feature between the different specimens.
- I would move the sample accessions to the cell suspensions because the information they capture is more similar to our cell suspension than our specimens
VDJ:
- the tissue for specimensFCA_gut8090115, Human_colon_16S8157851 Human_colon_16S8157867 should be mesenteric lymph node not colon
- the tissue for specimens FCA_gut8090111, FCA_gut8090112, FCA_gut8090113,FCA_gut8090114,FCA_gut8090115,FCA_gut8090116,FCA_gut8090117,FCA_gut8090118andFCA_gut809011` should be gut
- some specimens look like they are shared between VDJ and scRNA seq, for example F41 kidney or F45 spleen. I would model them with one specimen/ cell suspension and then apply the different library prep
visium:
- I disagree with how the specimens are modelled. I would represent each organ from a given donor with a single specimen because there is no distinguishable feature between the different specimens. So 10 specimens and then 15 imaged specimens
- I would remove the sample accessions and move them to the imaged specimen tab

Collection protocol

I would remove “All tissues were processed into single cell suspensions immediately upon receipt.” from collection_protocol_fetal_tissue since it’s used for visium specimens as well, which are not dissociated

Cell line

the mouse cell line has no input specimen and there is no mouse specimen nor donor
please add the induction protocol for the iPSC
for the mouse cell line you could add a dissociation protocol
you could add the publication info for each cell line
you could fill out the cell type / tissue type

Organoids

from what I understood of the organoid protocol each human cell line is mixed with the mouse cell line. I would model this with 2 organoids, one for each cell line, 2 cell suspensions and pool them at the library preparation step.
the ncbi_taxon_id should be human + mouse for all organoids
the enrichment protocols enrichment_protocol_facs_ATOs_weekX can be removed from the organoid tab and applied at the cell suspension level

Enrichment protocol

the DAPI- enrichment step can be recorded with the cell viability method in the cell suspension tab, so you could potentially get rid of the DAPI- markers / enrichment protols

Cell suspension

organoids:
- the ncbi_taxon_id should be human + mouse
- all organoids should have the enrichment_protocol_cell_size_ATOs enrichment protocol
- 6180STDY9448808_cells should have enrichment protocol enrichment_protocol_facs_ATOs_week3 and a CD45+ enrichment protocol
- you could add the sample accessions
scRNAseq
- FCAImmP7851896_cells should have an enrichment protocol CD137+ not CD45+
- I would model this with 55 cell suspensions, some of the ones that exist now have no distinguishing feature and are likely sequencing replicates. For example FCAImmP7803020 and FCAImmP7803021 have the same enrichment and the same donor + tissue.
vdj
- some cell suspensions have no distinguishing feature and are probably sequencing replicates. Like FCA_gut8090111 FCA_gut8090112 and FCA_gut8090113
- some cell suspensions should be shared with scRNAseq like FCAImmP7528290 from scRNA and FCAImmP7607593 from vdj, they come from the same donor, sample and have the same enrichment step
- FCAImmP7851889_cells should have enrichment_protocol_MAIT enrichment protocol

Imaging preparation protocol

I would split the imaging protocol in two, based on the different permeabilisation times, so that in the future it will be easier to add the field for permeabilisation time which we are adding to the schema

Imaged specimen

you could add the sample accessions
the imaged specimens are triplicated

Imaging protocol

you could fill in the microscope setup description

Image file

the image files are triplicated
I would remove the experiment accessions because they only point to the sequence files

Library preparation

vdj: the umi offset should be 16 for both library_preparation_TCR library_preparation_Ig
visium: the umi barcode length should be 12

Sequencing protocol

sequencing_protocol_visium should use tag based single cell RNA sequencing

Sequence file

the content ontology term and label are empty
all the vdj files are triplicated in the spreadsheet - very easy to fix but how did it happen?

Analysis file

is there a particular reason you didn’t include the h5ad files available here?
PAN.A01.v01.raw_count.20210429.PFI.embedding.h5ad and PAN.A01.v01.entire_data_normalised_log.20210429.full_obs.annotated.clean.csv are linked to cell suspensions for organoids, Visium and VDJ, but the files is only for scRNAseq
you could add the cell count

ami-day commented 2 years ago

Hi Ami, I'm done with the secondary review! Let me know if you want to discuss anything

Thank you @idazucchi, I have made most of your suggested updates. However, some things I decided to keep the same, so I am adding my comments on those below.

General

You mentioned a few times disagreeing with the modelling of the specimens and cell suspensions. I believe in this particular study, there are unique samples derived from the same organ type and donor. If they are processed in the same way, they are still unique samples. This especially makes sense in light of 10X Visium samples taken at different spatial locations but can apply to scRNA-Seq too. I would prefer to keep all the sample IDs linked to biosample accessions as they were initially curated and linked in the SCEA MAGE-TAB files.

Donor

would it be possible to obtain the HDBR accessions for the donors?

I had a look and I don't see these in the donor supplementary material, or the HDBR website. Have you been able to find the HDBR accessions for a dataset in the past?

Specimen

the tissue for specimens FCA_gut8090111, FCA_gut8090112, FCA_gut8090113,FCA_gut8090114,FCA_gut8090115,FCA_gut8090116,FCA_gut8090117,FCA_gut8090118andFCA_gut809011` should be gut

Selecting the appropriate organ ontology is difficult here, as I understand, "gut" is not an organ, but a system of organs. However the metadata has been annotated at the level of the gut, so in this case I think you're right to select gut, I updated with "gastrointestinal system".

Cell line

please add the induction protocol for the iPSC

In this case, I think it is not necessary (I remember having a conversation about this with someone previously). The hIPSC project cell lines are well known and well established, the associated publication is referenced, and the authors obtained the cell lines in the iPSC state (as opposed to running the iPSC induction protocol themselves).

you could fill out the cell type / tissue type

I think it doesn't make sense to do this for iPSC cell lines. They do not reflect a differentiated tissue or cell type other than iPSC.

Sequencing protocol

sequencing_protocol_visium should use tag based single cell RNA sequencing

I thought 10X visium is typically applied at the bulk level to a small set of cells in a specific spatial location. Which bit of information did you find that suggested the 10X visium was at the level of single cell?

Analysis file

is there a particular reason you didn’t include the h5ad files available here?

I'm not sure what you mean here, PAN.A01.v01.raw_count.20210429.PFI.embedding.h5ad and 'PAN.A01.v01.raw_count.20210429.PFI.embedding.h5ad' files were downloaded from https://developmental.cellatlas.io/fetal-immune. Do you mean, why didn't I download the spatial and VDJ h5ad files? I'm not sure why I didn't do this, I will add them now.

ami-day commented 2 years ago

ecf1dc81-0ff3-4f81-b927-99decb910c5a

ami-day commented 2 years ago

Graph validating.

ami-day commented 2 years ago

Syncing 2 missing files then will re-graph validate

ami-day commented 2 years ago

Submitted.

MightyAx commented 2 years ago

This export failing, investigation notes follow:

190 of the 217 export jobs are in the error queue.
The exports are spread out over the last week
Looking at the logs for the last 7 days there are 190 timeout exceptions waiting for the GCS data transfer

MightyAx commented 2 years ago

Cause Identified:

The datafile sync was completed (subject to manual verification) but the exporter job that was waiting for the completion was timedout and never recorded in ingest that the data file process was finished.

All other pods have been waiting to for ingest to be updated with the state of the data file synchronisation before starting any metadata synchronisation.

Starting the pods now they would still wait because ingest will never be updated with the success of the data file synchronisation job.

Operational Fix:

[x] Manually verify the data file sync is complete
[x] Update ingest that the data file sync is complete
[x] Requeue the errored jobs

Optional Software fixes:

[ ] Investigate why log messages from terra_exporter.py are not written to the console

Preventing the issue in future

[ ] Why does metadata synchronisation wait for the eventually consistent data file sync to finish?

MightyAx commented 2 years ago

Manually verifying the data file sync is complete

Expected files: 603

➜ gsutil ls -r "gs://broad-dsp-monster-hca-prod-ebi-storage/prod/fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a/data/" | sed -e 's/.*\.//' | sort | uniq -c                                    
  1 csv
579 gz
  9 h5ad
  3 sh
 15 tiff
  1 txt

csv + gz + h5ad + tiff = 604

there's an extra h5ad, some sh files and a txt file in the payload of the data file sync that isn't in the project. I think either these need to be added to the project or they extra to the project and we can continue with exporting. Files that are in the data export but not in the project (which is probably fine)

# Probably a duplicate of Visium10X_data_LI.h5ad
gs://broad-dsp-monster-hca-prod-ebi-storage/prod/fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a/data/Visium10X_data_LI (1).h5ad
gs://broad-dsp-monster-hca-prod-ebi-storage/prod/fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a/data/download.sh
gs://broad-dsp-monster-hca-prod-ebi-storage/prod/fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a/data/download2.sh
gs://broad-dsp-monster-hca-prod-ebi-storage/prod/fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a/data/download3.sh
gs://broad-dsp-monster-hca-prod-ebi-storage/prod/fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a/data/tmp.txt

MightyAx commented 2 years ago

Update ingest that the data file sync is complete

PATCH /exportJobs/62e268915cc3b03957ee94f3 HTTP/1.1
Authorization: Bearer <snipped_for_security>
Content-Type: application/json
User-Agent: PostmanRuntime/7.29.2
Accept: */*
Cache-Control: no-cache
Postman-Token: bc4f6dcf-e408-497d-8430-1e0abf16221a
Host: api.ingest.archive.data.humancellatlas.org
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Content-Length: 97

{
"context": {
"totalAssayCount": 217,
"isDataTransferComplete": true
}
}

MightyAx commented 2 years ago

Requeue the errored jobs

I've put the 190 export jobs back on the ingest.terra.experiments.new queue, which is now busy with other exports, the queue is 450 experiments long)

Theoretically no manual intervention will be required when all the messages are processed the submission should be updated with the export process as you would expect.

MightyAx commented 2 years ago

This one was just stuck behind a project was is "Actually" stuck, freeing up the export queue and moving just this projects exports across has made this export successfully.

ami-day commented 2 years ago

Submitted import form.

ebi-ait / hca-ebi-wrangler-central