Closed ESapenaVentura closed 2 years ago
Impossible to demux without submitter input - Runs are missing index files
Meeting with @Wkt8 @ESapenaVentura and @ami-day to talk about demultiplexing.
@ESapenaVentura @gabsie shall we close this ticket and move it to "ineligible" status in ingest because we can't accept multiplexed data without we multiplexing info?
Happy for this to happen - or to move it to the "stalled" status. Either way we can't move forwards on this.
Following the information provided by 10X Genomics, having no index sequenced can happen 1) by accident or 2) when someone only loads a single sample in the lane and reasonably figures they don't need to sequence the sample index. Nevertheless, it's always a good idea to sequence the indexes for QC reasons.
Although this looks like an HCA paper, a lot of samples seem to be sequenced in lane 3, following the FASTQ file naming convention.
I'll email the corresponding authors asking for the index files and close this ticket if they don't get back in a week. We could also wrangle only the analysis files, now that we accept them.
First email response:
If it's hashing antibody sequences to demultiplex donors per reaction, then both the tag sequences and how they pair up with individual samples and reactions are in the supplementary data in Mendeley data linked to the publication here: DOI: 10.17632/gncg57p5x9.2 .Sample overview (sheet 1) has the tag ID and 10x reaction paired with sample information, while sheet 22 (CITE-Seq Antibody Oligo Seq) has the sequences.
De-hashed, clustered and labelled data is also available as Seurat objects via the below link, in case that helps: https://www.dropbox.com/sh/qmhdbqh56cp0t3n/AADR2cj522sh36R0luzxJJTOa?dl=0
Second email response:
The fastq files uploaded are already demultiplexed, i.e. each entry in the GEO series is a separate library with identical i5/i7 Illumina indexes, so there's no need for a separate index read fastq file. The index sequences are available in the read header in the files we've uploaded if you absolutely need them, though I'm not 100% sure how/if SRA preserves that information. There are ~7-9 individual samples within each library that are hashed using antibodies, though, which you can de-hash using the paired antibody libraries (on GEO, indicated as "Hashing Antibody Library" in the name or "Second Tag Hashing Antibody Library") and the meta data on Mendeley data. There are two separate antibody libraries used (as opposed to the standard hashing procedure originally described here: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1603-1) because the commercial TotalSeq does not work so well for early fetal samples. Please see this section of our paper for an overview of the demultiplexing strategy we used, if you haven't seen it already: https://www.sciencedirect.com/science/article/pii/S009286742031686X#sec3.5.13.
Hope that helps to clarify things - would be happy to go over the data in more detail over a zoom call some time if you'd like, as I realise it's not a straight-forward "one sample, one library" experimental set up and it can be difficult to convey clearly how exactly the data fits together given the limitations of SRA/GEO.
It seems that the upload files are in fact demultiplexed, so we could include them
Should we upload the raw data? @ESapenaVentura to take a look
Let's upload the raw data.
Waiting on schema changes. Needed consulting/secondary review especially of the following elements:
@Wkt8 or @ami-day to review the questions above
Waiting on schema changes. Needed consulting/secondary review especially of the following elements:
- Imaging preparation protocol: numerical aperture appears as a required field. Is it required? How can we know it if it's not stated on the paper? Shall I email the contributors?
We have requested a metadata schema update to modify this field to "not required". For now it might be worth asking the authors, or move it to stalled until the metadata schema update is finalised.
- Imaging protocol Channel and Imaging protocol Probe: Just to check, are these tabs for fluorescence microscopy? In this case, it's bright-field microscopy, and they do not state in the paper or images using a specific channel or probe. They do use H&E (hematoxylin and eosin) staining.
Yes, the channel is referring to fluorescence microscopy.
The probe can be for example a fluorescence-labelled antibody or an RNA probe which hybridises to RNA in-situ and is detected with a fluorescence labelled probe. E.g. https://acdbio.com/science/applications/research-areas/covid-19-coronavirus?_bk=&_bt=544177046605&_bm=&_bn=g&gclid=CjwKCAiAyPyQBhB6EiwAFUuakt0CyJ2n1lTAZX4d17lgvRsKfW8-kQUZ9sy1tXCDhiTmqRe1KkQyHBoCaOsQAvD_BwE
Some experiments consist of a "probe panel" with 100-1000s of probes. E.g. https://nanostring.com/products/geomx-digital-spatial-profiler/geomx-rna-assays/geomx-cancer-transcriptome-atlas/
In the case of 10X visium or NanoString Digital Spatial Profiling, the probes are used to identify regions of interest. RNAScope probes can be used to do this, despite it not being a sequencing technology itself.
- Image files, JSON files: Do we need to include them? What ontology term could we use for content description?
I believe we should definitely include the image files linked to each imaged specimen. Specifically, the image file with the 10X barcoded spots and coordinates (in the case of 10X). This would enable a user to map spatial regions of interest to the spatial barcodes according to their own annotations in the image. The contributor's annotations should also be provided, if they have them. E.g. certain tissue structures or morphology.
- Visium modeling: is it OK?
Secondary review:
This looks really good to me, only a couple of comments:
Project Contributors
Donor organism
Collection protocol
Enrichment protocol
library preparation protocol
Image file / Imaged specimen
Analysis file
E.g. from the publication, they mention identifying anatomical landmarks: "All fetal intestinal tissues were examined to identify anatomical landmarks (stomach, Meckel’s diverticulum, and/or appendix) and if present tissues from Terminal Ileum (TI), proximal colon and distal colon were separated for processing. In low gestation samples (≤12pcw) where only a small amount of colonic tissue remained, the entire tissue would be processed as “hindgut” without a proximal/distal division. TI was sampled by taking < 2cm upstream of appendix; similarly, in early gestation sampling was performed from the region upstream of the appendix or hindgut, due to small size of samples at these time points this tissue was also termed distal SI as it may extend past the TI."
General
Project Contributors: corrected
Donor organism: Corrected
Collection protocol: Biopsy is suitable for the living donor organism, but not for the fetal samples as they are not living at the moment of collection.
Enrichment protocol: corrected
Library preparation protocol: corrected
Image file / Imaged specimen: corrected
Analysis file: TI, hindgut and different parts of colon conform individual samples. For example, the sample AAQ_ti has only terminal ileum cells. The same happens for visium, where the sample D_ti1 only includes tissue from the terminal ileum. This is described in the metadata organ and organ part.
General: The ontologies will be added once the schema changes that are blocking the dataset are applied, as a new ontology branch is missing
We should use the following terms for the JSON files, or maybe 'image metadata' for both:
spot diameter: data:3108 Experimental measurement
scale factors: data:3546 'Image metadata'
All the files with content description "molecular property identifier" need to be changed to "cell barcode"
Primary Wrangler: Irene Secondary Wrangler: TBD
Associated files:
Google Drive: https://drive.google.com/drive/folders/1bO1XfeFXYQJ7Hk0p364-O9DDsqS-upaB
Published study links
Paper: https://www.sciencedirect.com/science/article/pii/S009286742031686X#sec3
Accessioned data: GSE158702
Ingest: https://contribute.data.humancellatlas.org/projects/detail?uuid=fa3f460f-4fb9-4ced-b548-8ba6a8ecae3f
Key Events