ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

GSE158702 - Spatiotemporal Analysis of Human Intestinal Development at Single Cell Resolution #239

Closed ESapenaVentura closed 2 years ago

ESapenaVentura commented 3 years ago

Primary Wrangler: Irene Secondary Wrangler: TBD

Associated files:

Google Drive: https://drive.google.com/drive/folders/1bO1XfeFXYQJ7Hk0p364-O9DDsqS-upaB

Published study links

Paper: https://www.sciencedirect.com/science/article/pii/S009286742031686X#sec3

Accessioned data: GSE158702

Ingest: https://contribute.data.humancellatlas.org/projects/detail?uuid=fa3f460f-4fb9-4ced-b548-8ba6a8ecae3f

Key Events

ESapenaVentura commented 3 years ago

Impossible to demux without submitter input - Runs are missing index files

Wkt8 commented 3 years ago

Meeting with @Wkt8 @ESapenaVentura and @ami-day to talk about demultiplexing.

ami-day commented 2 years ago

@ESapenaVentura @gabsie shall we close this ticket and move it to "ineligible" status in ingest because we can't accept multiplexed data without we multiplexing info?

Wkt8 commented 2 years ago

Happy for this to happen - or to move it to the "stalled" status. Either way we can't move forwards on this.

ipediez commented 2 years ago

Following the information provided by 10X Genomics, having no index sequenced can happen 1) by accident or 2) when someone only loads a single sample in the lane and reasonably figures they don't need to sequence the sample index. Nevertheless, it's always a good idea to sequence the indexes for QC reasons.

Although this looks like an HCA paper, a lot of samples seem to be sequenced in lane 3, following the FASTQ file naming convention.

I'll email the corresponding authors asking for the index files and close this ticket if they don't get back in a week. We could also wrangle only the analysis files, now that we accept them.

ipediez commented 2 years ago

First email response:

If it's hashing antibody sequences to demultiplex donors per reaction, then both the tag sequences and how they pair up with individual samples and reactions are in the supplementary data in Mendeley data linked to the publication here: DOI: 10.17632/gncg57p5x9.2 .Sample overview (sheet 1) has the tag ID and 10x reaction paired with sample information, while sheet 22 (CITE-Seq Antibody Oligo Seq) has the sequences.

De-hashed, clustered and labelled data is also available as Seurat objects via the below link, in case that helps: https://www.dropbox.com/sh/qmhdbqh56cp0t3n/AADR2cj522sh36R0luzxJJTOa?dl=0

Second email response:

The fastq files uploaded are already demultiplexed, i.e. each entry in the GEO series is a separate library with identical i5/i7 Illumina indexes, so there's no need for a separate index read fastq file. The index sequences are available in the read header in the files we've uploaded if you absolutely need them, though I'm not 100% sure how/if SRA preserves that information. There are ~7-9 individual samples within each library that are hashed using antibodies, though, which you can de-hash using the paired antibody libraries (on GEO, indicated as "Hashing Antibody Library" in the name or "Second Tag Hashing Antibody Library") and the meta data on Mendeley data. There are two separate antibody libraries used (as opposed to the standard hashing procedure originally described here: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1603-1) because the commercial TotalSeq does not work so well for early fetal samples. Please see this section of our paper for an overview of the demultiplexing strategy we used, if you haven't seen it already: https://www.sciencedirect.com/science/article/pii/S009286742031686X#sec3.5.13.

Hope that helps to clarify things - would be happy to go over the data in more detail over a zoom call some time if you'd like, as I realise it's not a straight-forward "one sample, one library" experimental set up and it can be difficult to convey clearly how exactly the data fits together given the limitations of SRA/GEO.

It seems that the upload files are in fact demultiplexed, so we could include them

idazucchi commented 2 years ago

Should we upload the raw data? @ESapenaVentura to take a look

Wkt8 commented 2 years ago

Let's upload the raw data.

ipediez commented 2 years ago

Waiting on schema changes. Needed consulting/secondary review especially of the following elements:

ESapenaVentura commented 2 years ago

@Wkt8 or @ami-day to review the questions above

ami-day commented 2 years ago

Waiting on schema changes. Needed consulting/secondary review especially of the following elements:

  • Imaging preparation protocol: numerical aperture appears as a required field. Is it required? How can we know it if it's not stated on the paper? Shall I email the contributors?

We have requested a metadata schema update to modify this field to "not required". For now it might be worth asking the authors, or move it to stalled until the metadata schema update is finalised.

  • Imaging protocol Channel and Imaging protocol Probe: Just to check, are these tabs for fluorescence microscopy? In this case, it's bright-field microscopy, and they do not state in the paper or images using a specific channel or probe. They do use H&E (hematoxylin and eosin) staining.

Yes, the channel is referring to fluorescence microscopy.

The probe can be for example a fluorescence-labelled antibody or an RNA probe which hybridises to RNA in-situ and is detected with a fluorescence labelled probe. E.g. https://acdbio.com/science/applications/research-areas/covid-19-coronavirus?_bk=&_bt=544177046605&_bm=&_bn=g&gclid=CjwKCAiAyPyQBhB6EiwAFUuakt0CyJ2n1lTAZX4d17lgvRsKfW8-kQUZ9sy1tXCDhiTmqRe1KkQyHBoCaOsQAvD_BwE

Some experiments consist of a "probe panel" with 100-1000s of probes. E.g. https://nanostring.com/products/geomx-digital-spatial-profiler/geomx-rna-assays/geomx-cancer-transcriptome-atlas/

In the case of 10X visium or NanoString Digital Spatial Profiling, the probes are used to identify regions of interest. RNAScope probes can be used to do this, despite it not being a sequencing technology itself.

  • Image files, JSON files: Do we need to include them? What ontology term could we use for content description?

I believe we should definitely include the image files linked to each imaged specimen. Specifically, the image file with the 10X barcoded spots and coordinates (in the case of 10X). This would enable a user to map spatial regions of interest to the spatial barcodes according to their own annotations in the image. The contributor's annotations should also be provided, if they have them. E.g. certain tissue structures or morphology.

  • Visium modeling: is it OK?
ami-day commented 2 years ago

Secondary review:

This looks really good to me, only a couple of comments:

Project Contributors

Donor organism

Collection protocol

Enrichment protocol

library preparation protocol

Image file / Imaged specimen

Analysis file

E.g. from the publication, they mention identifying anatomical landmarks: "All fetal intestinal tissues were examined to identify anatomical landmarks (stomach, Meckel’s diverticulum, and/or appendix) and if present tissues from Terminal Ileum (TI), proximal colon and distal colon were separated for processing. In low gestation samples (≤12pcw) where only a small amount of colonic tissue remained, the entire tissue would be processed as “hindgut” without a proximal/distal division. TI was sampled by taking < 2cm upstream of appendix; similarly, in early gestation sampling was performed from the region upstream of the appendix or hindgut, due to small size of samples at these time points this tissue was also termed distal SI as it may extend past the TI."

General

ipediez commented 2 years ago
ipediez commented 2 years ago

Asked for an ontology term for the JSON files here

ipediez commented 2 years ago

We should use the following terms for the JSON files, or maybe 'image metadata' for both:

spot diameter: data:3108 Experimental measurement

scale factors: data:3546 'Image metadata'

ipediez commented 2 years ago

All the files with content description "molecular property identifier" need to be changed to "cell barcode"