GSE158702 - Spatiotemporal Analysis of Human Intestinal Development at Single Cell Resolution

ESapenaVentura commented 3 years ago

Primary Wrangler: Irene Secondary Wrangler: TBD

Associated files:

Google Drive: https://drive.google.com/drive/folders/1bO1XfeFXYQJ7Hk0p364-O9DDsqS-upaB

Published study links

Paper: https://www.sciencedirect.com/science/article/pii/S009286742031686X#sec3

Accessioned data: GSE158702

Ingest: https://contribute.data.humancellatlas.org/projects/detail?uuid=fa3f460f-4fb9-4ced-b548-8ba6a8ecae3f

Key Events

[x] convert published metadata to HCA spreadsheet
[x] manually curate dataset to meet HCA metadata standard
[x] Upload sheet to validate metadata
[x] Check linking using ingest graph validator
[x] Transfer files to ingest to validate data files
[x] Ask the Secondary Wrangler for an end-to-end review of the project. Ask the Expertise Wrangler to review specific tabs if needed
[x] Submit dataset to Production

ESapenaVentura commented 3 years ago

Impossible to demux without submitter input - Runs are missing index files

Wkt8 commented 3 years ago

Meeting with @Wkt8 @ESapenaVentura and @ami-day to talk about demultiplexing.

ami-day commented 2 years ago

@ESapenaVentura @gabsie shall we close this ticket and move it to "ineligible" status in ingest because we can't accept multiplexed data without we multiplexing info?

Wkt8 commented 2 years ago

Happy for this to happen - or to move it to the "stalled" status. Either way we can't move forwards on this.

ipediez commented 2 years ago

Following the information provided by 10X Genomics, having no index sequenced can happen 1) by accident or 2) when someone only loads a single sample in the lane and reasonably figures they don't need to sequence the sample index. Nevertheless, it's always a good idea to sequence the indexes for QC reasons.

Although this looks like an HCA paper, a lot of samples seem to be sequenced in lane 3, following the FASTQ file naming convention.

I'll email the corresponding authors asking for the index files and close this ticket if they don't get back in a week. We could also wrangle only the analysis files, now that we accept them.

ipediez commented 2 years ago

First email response:

If it's hashing antibody sequences to demultiplex donors per reaction, then both the tag sequences and how they pair up with individual samples and reactions are in the supplementary data in Mendeley data linked to the publication here: DOI: 10.17632/gncg57p5x9.2 .Sample overview (sheet 1) has the tag ID and 10x reaction paired with sample information, while sheet 22 (CITE-Seq Antibody Oligo Seq) has the sequences.

De-hashed, clustered and labelled data is also available as Seurat objects via the below link, in case that helps: https://www.dropbox.com/sh/qmhdbqh56cp0t3n/AADR2cj522sh36R0luzxJJTOa?dl=0

Second email response:

The fastq files uploaded are already demultiplexed, i.e. each entry in the GEO series is a separate library with identical i5/i7 Illumina indexes, so there's no need for a separate index read fastq file. The index sequences are available in the read header in the files we've uploaded if you absolutely need them, though I'm not 100% sure how/if SRA preserves that information. There are ~7-9 individual samples within each library that are hashed using antibodies, though, which you can de-hash using the paired antibody libraries (on GEO, indicated as "Hashing Antibody Library" in the name or "Second Tag Hashing Antibody Library") and the meta data on Mendeley data. There are two separate antibody libraries used (as opposed to the standard hashing procedure originally described here: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1603-1) because the commercial TotalSeq does not work so well for early fetal samples. Please see this section of our paper for an overview of the demultiplexing strategy we used, if you haven't seen it already: https://www.sciencedirect.com/science/article/pii/S009286742031686X#sec3.5.13.

Hope that helps to clarify things - would be happy to go over the data in more detail over a zoom call some time if you'd like, as I realise it's not a straight-forward "one sample, one library" experimental set up and it can be difficult to convey clearly how exactly the data fits together given the limitations of SRA/GEO.

It seems that the upload files are in fact demultiplexed, so we could include them

idazucchi commented 2 years ago

Should we upload the raw data? @ESapenaVentura to take a look

Wkt8 commented 2 years ago

Let's upload the raw data.

ipediez commented 2 years ago

Waiting on schema changes. Needed consulting/secondary review especially of the following elements:

Imaging preparation protocol: numerical aperture appears as a required field. Is it required? How can we know it if it's not stated on the paper? Shall I email the contributors?
Imaging protocol Chanell and Imaging protocol Probe: Just to check, are these tabs for fluorescence microscopy? In this case, it's bright-field microscopy, and they do not state in the paper or images using a specific channel or probe. They do use H&E (hematoxylin and eosin) staining.
Image files, JSON files: Do we need to include them? What ontology term could we use for content description?
Visium modeling: is it OK?

ESapenaVentura commented 2 years ago

@Wkt8 or @ami-day to review the questions above

ami-day commented 2 years ago

Waiting on schema changes. Needed consulting/secondary review especially of the following elements:

Imaging preparation protocol: numerical aperture appears as a required field. Is it required? How can we know it if it's not stated on the paper? Shall I email the contributors?

We have requested a metadata schema update to modify this field to "not required". For now it might be worth asking the authors, or move it to stalled until the metadata schema update is finalised.

Imaging protocol Channel and Imaging protocol Probe: Just to check, are these tabs for fluorescence microscopy? In this case, it's bright-field microscopy, and they do not state in the paper or images using a specific channel or probe. They do use H&E (hematoxylin and eosin) staining.

Yes, the channel is referring to fluorescence microscopy.

The probe can be for example a fluorescence-labelled antibody or an RNA probe which hybridises to RNA in-situ and is detected with a fluorescence labelled probe. E.g. https://acdbio.com/science/applications/research-areas/covid-19-coronavirus?_bk=&_bt=544177046605&_bm=&_bn=g&gclid=CjwKCAiAyPyQBhB6EiwAFUuakt0CyJ2n1lTAZX4d17lgvRsKfW8-kQUZ9sy1tXCDhiTmqRe1KkQyHBoCaOsQAvD_BwE

Some experiments consist of a "probe panel" with 100-1000s of probes. E.g. https://nanostring.com/products/geomx-digital-spatial-profiler/geomx-rna-assays/geomx-cancer-transcriptome-atlas/

In the case of 10X visium or NanoString Digital Spatial Profiling, the probes are used to identify regions of interest. RNAScope probes can be used to do this, despite it not being a sequencing technology itself.

Image files, JSON files: Do we need to include them? What ontology term could we use for content description?

I believe we should definitely include the image files linked to each imaged specimen. Specifically, the image file with the 10X barcoded spots and coordinates (in the case of 10X). This would enable a user to map spatial regions of interest to the spatial barcodes according to their own annotations in the image. The contributor's annotations should also be provided, if they have them. E.g. certain tissue structures or morphology.

Visium modeling: is it OK?

ami-day commented 2 years ago

Secondary review:

This looks really good to me, only a couple of comments:

Project Contributors

Author names incorrect format. Should be first name,middle name initial,surname OR first name,,surname

Donor organism

Development stage could be more specific. E.g. HsapDv:0000047: 10th week post-fertilization human stage

Collection protocol

I think "dissection" is more part of the dissociation process. I think "biopsy" might be more accurate http://purl.obolibrary.org/obo/ERO_0001334

Enrichment protocol

It looks like there should be an antibody staining protocol.

library preparation protocol

If an HTO antibody staining protocol was applied to cell suspensions, there should be an HTO library per sample/pooled sample.

Image file / Imaged specimen

I'm wondering if we should ideally link either the image files to biosamples accessions in the image file tab or imaged specimens to biosamples accessions in the imaged specimen tab?

Analysis file

Are/were you able to get an annotations file from the authors? Specifically I mean annotations within the image, for example, structural or morphological features/landmarks.

E.g. from the publication, they mention identifying anatomical landmarks: "All fetal intestinal tissues were examined to identify anatomical landmarks (stomach, Meckel’s diverticulum, and/or appendix) and if present tissues from Terminal Ileum (TI), proximal colon and distal colon were separated for processing. In low gestation samples (≤12pcw) where only a small amount of colonic tissue remained, the entire tissue would be processed as “hindgut” without a proximal/distal division. TI was sampled by taking < 2cm upstream of appendix; similarly, in early gestation sampling was performed from the region upstream of the appendix or hindgut, due to small size of samples at these time points this tissue was also termed distal SI as it may extend past the TI."

General

I don't see ontologies for a lot of tabs/columns. I guess you planned on adding them later?

ipediez commented 2 years ago

Project Contributors: corrected
Donor organism: Corrected
Collection protocol: Biopsy is suitable for the living donor organism, but not for the fetal samples as they are not living at the moment of collection.
Enrichment protocol: corrected
Library preparation protocol: corrected
Image file / Imaged specimen: corrected
Analysis file: TI, hindgut and different parts of colon conform individual samples. For example, the sample AAQ_ti has only terminal ileum cells. The same happens for visium, where the sample D_ti1 only includes tissue from the terminal ileum. This is described in the metadata organ and organ part.
General: The ontologies will be added once the schema changes that are blocking the dataset are applied, as a new ontology branch is missing

ipediez commented 2 years ago

Asked for an ontology term for the JSON files here

ipediez commented 2 years ago

We should use the following terms for the JSON files, or maybe 'image metadata' for both:

spot diameter: data:3108 Experimental measurement

scale factors: data:3546 'Image metadata'

ipediez commented 2 years ago

All the files with content description "molecular property identifier" need to be changed to "cell barcode"

ebi-ait / hca-ebi-wrangler-central

GSE158702 - Spatiotemporal Analysis of Human Intestinal Development at Single Cell Resolution #239