ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

Mapping the developing human immune system across organs #634

Closed ami-day closed 2 years ago

ami-day commented 2 years ago

project short name

DevelopingImmuneSystem

Primary wrangler

Ami

Secondary wrangler

Ida

Ingest

https://contribute.data.humancellatlas.org/projects/detail?uuid=fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a&tab=project

submission https://contribute.data.humancellatlas.org/submissions/detail?uuid=8c987aea-34b5-4f17-ba4d-cc6726730638&project=fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a

Publication:

preprint Mapping the developing human immune system across organs

Data

Google Sheet:

Latest (De-duped samples): https://docs.google.com/spreadsheets/d/1lSEiH_ZS-H8xtTOxJ9NTq8WhXIiqI8pKvFUcZhakq84/edit

1st version https://docs.google.com/spreadsheets/d/1GJbl-UOWNXvkcV7Qsx0em7x-DBaqR8pxPJ9p6v0oOi4/edit#gid=1259194338

ami-day commented 2 years ago

Emma has sent the ArrayExpress login details for the scRNA-seq and scVDJ-seq datasets. The 10X visium log-in will be sent soon.

ami-day commented 2 years ago

Moving this to stalled and from release 14 because we need additional info from the authors, and they have not yet made their data public yet (I have a private ArrayExpress log-in).

ami-day commented 2 years ago

Emma emailed to say they are updating the data and adding new metadata. I said we can aim to get it submitted for release 17 (May release).

ofanobilbao commented 2 years ago

@ami-day why is this in stalled? Does not have any labels to explain. If so, are we still aiming at Release 17?

ami-day commented 2 years ago

They aren't ready to submit it yet. They are generating more data.

gabsie commented 2 years ago

Hey @ami-day , @ofanobilbao - can we check whether these people are ready now? We just saw this advertised in the HCA opening slides. Then we can maybe prioritise for this release.

ofanobilbao commented 2 years ago

@gabsie I've moved it to Wrangling for @ami-day to prioritise

ami-day commented 2 years ago

I have an email exchange with Emma (author), we decided it would be best to wait until they have the ArrayExpress accessions and those datasets published, so that I can use that metadata as a template. When we spoke they had some data ready but not the full project. When they get back to me as confirmed, I will move forward with this ticket.

gabsie commented 2 years ago

Thanks, @ami-day - it might be nice to remind them. :) Say we have seen the slide at the HCA meeting, in case they have forgotten about us.

ami-day commented 2 years ago

The new data is now ready to curate (Emma emailed about it).

MightyAx commented 2 years ago

@idazucchi to secondary review

ESapenaVentura commented 2 years ago

@idazucchi is almost ready with the secondary review

idazucchi commented 2 years ago

Hi Ami, I'm done with the secondary review! Let me know if you want to discuss anything

Project

Donor

Specimen

Collection protocol

Cell line

Organoids

Enrichment protocol

Cell suspension

Imaging preparation protocol

Imaged specimen

Imaging protocol

Image file

Library preparation

Sequencing protocol

Sequence file

Analysis file

ami-day commented 2 years ago

Hi Ami, I'm done with the secondary review! Let me know if you want to discuss anything

Thank you @idazucchi, I have made most of your suggested updates. However, some things I decided to keep the same, so I am adding my comments on those below.

General

You mentioned a few times disagreeing with the modelling of the specimens and cell suspensions. I believe in this particular study, there are unique samples derived from the same organ type and donor. If they are processed in the same way, they are still unique samples. This especially makes sense in light of 10X Visium samples taken at different spatial locations but can apply to scRNA-Seq too. I would prefer to keep all the sample IDs linked to biosample accessions as they were initially curated and linked in the SCEA MAGE-TAB files.

Donor

would it be possible to obtain the HDBR accessions for the donors?

I had a look and I don't see these in the donor supplementary material, or the HDBR website. Have you been able to find the HDBR accessions for a dataset in the past?

Specimen

the tissue for specimens FCA_gut8090111, FCA_gut8090112, FCA_gut8090113,FCA_gut8090114,FCA_gut8090115,FCA_gut8090116,FCA_gut8090117,FCA_gut8090118andFCA_gut809011` should be gut

Selecting the appropriate organ ontology is difficult here, as I understand, "gut" is not an organ, but a system of organs. However the metadata has been annotated at the level of the gut, so in this case I think you're right to select gut, I updated with "gastrointestinal system".

Cell line

please add the induction protocol for the iPSC

In this case, I think it is not necessary (I remember having a conversation about this with someone previously). The hIPSC project cell lines are well known and well established, the associated publication is referenced, and the authors obtained the cell lines in the iPSC state (as opposed to running the iPSC induction protocol themselves).

you could fill out the cell type / tissue type

I think it doesn't make sense to do this for iPSC cell lines. They do not reflect a differentiated tissue or cell type other than iPSC.

Sequencing protocol

sequencing_protocol_visium should use tag based single cell RNA sequencing

I thought 10X visium is typically applied at the bulk level to a small set of cells in a specific spatial location. Which bit of information did you find that suggested the 10X visium was at the level of single cell?

Analysis file

  • is there a particular reason you didn’t include the h5ad files available here?

I'm not sure what you mean here, PAN.A01.v01.raw_count.20210429.PFI.embedding.h5ad and 'PAN.A01.v01.raw_count.20210429.PFI.embedding.h5ad' files were downloaded from https://developmental.cellatlas.io/fetal-immune. Do you mean, why didn't I download the spatial and VDJ h5ad files? I'm not sure why I didn't do this, I will add them now.

ami-day commented 2 years ago

ecf1dc81-0ff3-4f81-b927-99decb910c5a

ami-day commented 2 years ago

Graph validating.

ami-day commented 2 years ago

Syncing 2 missing files then will re-graph validate

ami-day commented 2 years ago

Submitted.

MightyAx commented 2 years ago

This export failing, investigation notes follow:

MightyAx commented 2 years ago

Cause Identified:

The datafile sync was completed (subject to manual verification) but the exporter job that was waiting for the completion was timedout and never recorded in ingest that the data file process was finished.

All other pods have been waiting to for ingest to be updated with the state of the data file synchronisation before starting any metadata synchronisation.

Starting the pods now they would still wait because ingest will never be updated with the success of the data file synchronisation job.

Operational Fix:

Optional Software fixes:

Preventing the issue in future

MightyAx commented 2 years ago

Manually verifying the data file sync is complete

Expected files: 603

➜ gsutil ls -r "gs://broad-dsp-monster-hca-prod-ebi-storage/prod/fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a/data/" | sed -e 's/.*\.//' | sort | uniq -c                                    
  1 csv
579 gz
  9 h5ad
  3 sh
 15 tiff
  1 txt

csv + gz + h5ad + tiff = 604

there's an extra h5ad, some sh files and a txt file in the payload of the data file sync that isn't in the project. I think either these need to be added to the project or they extra to the project and we can continue with exporting. Files that are in the data export but not in the project (which is probably fine)

# Probably a duplicate of Visium10X_data_LI.h5ad
gs://broad-dsp-monster-hca-prod-ebi-storage/prod/fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a/data/Visium10X_data_LI (1).h5ad
gs://broad-dsp-monster-hca-prod-ebi-storage/prod/fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a/data/download.sh
gs://broad-dsp-monster-hca-prod-ebi-storage/prod/fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a/data/download2.sh
gs://broad-dsp-monster-hca-prod-ebi-storage/prod/fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a/data/download3.sh
gs://broad-dsp-monster-hca-prod-ebi-storage/prod/fcaa53cd-ba57-4bfe-af9c-eaa958f95c1a/data/tmp.txt
MightyAx commented 2 years ago

Update ingest that the data file sync is complete

PATCH /exportJobs/62e268915cc3b03957ee94f3 HTTP/1.1
Authorization: Bearer <snipped_for_security>
Content-Type: application/json
User-Agent: PostmanRuntime/7.29.2
Accept: */*
Cache-Control: no-cache
Postman-Token: bc4f6dcf-e408-497d-8430-1e0abf16221a
Host: api.ingest.archive.data.humancellatlas.org
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Content-Length: 97

{
"context": {
"totalAssayCount": 217,
"isDataTransferComplete": true
}
}
MightyAx commented 2 years ago

Requeue the errored jobs

I've put the 190 export jobs back on the ingest.terra.experiments.new queue, which is now busy with other exports, the queue is 450 experiments long)

Theoretically no manual intervention will be required when all the messages are processed the submission should be updated with the export process as you would expect.

MightyAx commented 2 years ago

This one was just stuck behind a project was is "Actually" stuck, freeing up the export queue and moving just this projects exports across has made this export successfully.

ami-day commented 2 years ago

Submitted import form.