chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
37 stars 23 forks source link

Add modality #686

Open BAevermann opened 9 months ago

BAevermann commented 9 months ago

Design (@brianraymor)

obs

...

modality

Key modality
Annotator Curator MUST annotate.
Value categorical with str categories. This MUST be "epigenomics" or "transcriptomics".

This MUST be the correct type for the corresponding assay:

For Assay MUST Use
10x multiome [EFO:0030059] "epigenomics" or "transcriptomics"
10x scATAC-seq [EFO:0030007] "epigenomics"
10x transcription profiling [EFO:0030080] and its descendants "transcriptomics"
BD Rhapsody Targeted mRNA [EFO:0700004] "transcriptomics"
BD Rhapsody Whole Transcriptome Analysis [EFO:0700003] "transcriptomics"
CEL-seq2 [EFO:0010010] and its descendants "transcriptomics"
DroNc-seq [EFO:0008720] "transcriptomics"
Drop-seq [EFO:0008722] "transcriptomics"
GEXSCOPE technology [EFO:0700011] "transcriptomics"
inDrop [EFO:0008780] "transcriptomics"
MARS-seq [EFO:0008796] "transcriptomics"
mCT-seq [EFO:0030060] "epigenomics" or"transcriptomics"
MERFISH [EFO:0008992] "transcriptomics"
methylation profiling by high throughput sequencing [EFO:0002761] and its descendants "epigenomics"
microwell-seq [EFO:0030002] "transcriptomics"
Patch-seq [EFO:0008853] "transcriptomics"
ScaleBio single cell RNA sequencing [EFO:0022490] "transcriptomics"
scATAC-seq [EFO:0010891] "epigenomics"
sci-Plex[EFO:0030026] "transcriptomics"
sci-RNA-seq [EFO:0010550] and its descendants "transcriptomics"
Seq-Well [EFO:0008919] and its descendants "transcriptomics"
Smart-like [EFO:0010184] and its descendants "transcriptomics"
spatial transcriptomics [EFO:0008994] and its descendants "transcriptomics"
SPLiT-seq [EFO:0009919] "transcriptomics"
STRT-seq [EFO:0008953] "transcriptomics"
TruDrop [EFO:0700010] "transcriptomics"

If the assay does not appear in this table, the most appropriate value MUST be selected and the curation team informed during submission so that the assay can be added to the table.


Context

At current the CELLxGENE schema does not capture the concept of a detected analyte.

This concept is usually implied in the assay name, for example the mRNA analyte as detected by 10x 3' transcriptional profiling. However, for assays such as "10x multiome" the analyte detected is ambiguous as it measures both mRNA and open chromatin.

This distinction is required by downstream tools such as Census or Expression to filter supported vs unsupported data.

brianraymor commented 9 months ago

~Would this only be required when there was complete support for 10X multiome? Or is this being proposed to allow census to consume this assay now and be "future proofed" when/if ATAC is supported?~

~Reviewing the draft for the assay tier proposal:~

~1. 10X multiome (RNA) is experimental. And there are 60 datasets?~ ~2. 10X multiome (ATAC) is unsupported.~

~How is 10X multiome currently modeled?~

jahilton commented 9 months ago

~Will also flag mCT-seq which measures both RNA & methylation (we have 1 Collection, which holds the expression data)~

~In Lattice, we use biological_macromolecule with an enum of RNA,DNA, or protein~ ~As an alternative, I could also imagine a field that lists the type of measurement being represented (expression, accessibility, etc.)~

BAevermann commented 9 months ago

~Assortment of proposals were discussed on 11/30~

~Current proposal~ ~Field name: Modality~ ~Values: (Controlled vocab)~ ~Transcriptomics~ ~Epigenomics~ ~Proteomics~ ~Spatial Transcriptomics~ ~Spatial Proteomics~ ~in-situ hybridization assay~

~Will meet in early Q1 to discuss further.~

jahilton commented 4 months ago

notes the "Modality" proposal:

Based on the proposal, I took a stab at mapping each current assay in the corpus to the Modality values - this sheet - for others to review Biggest Q is that I'm not sure how to characterize the morphology & electrophysiology measurements that are a part of Patch-seq (in addition to the transcriptomics).

pablo-gar commented 4 months ago

Agreed with Jason, we would be overloading this filed with the addition of "spatial". This axis of variation is likely to be already captured by assay. The main goal as I read it is to distinguished between molecules for downstream applications, and with the upcoming support for Spatial, I don't see a need to overlap this variable.

I'd prefer to stick to the name of the molecule or the omics term.

BAevermann commented 4 months ago

Thanks for the mapping @jahilton!

jahilton commented 4 months ago
brianraymor commented 4 months ago

April 15 2024 (@BAevermann, @brianraymor, @jahilton, @jychien, @pablo-gar)

obs['modality'] transcriptomics epigenomics proteomics

brianraymor commented 4 months ago

@BAevermann, @jahilton, @jychien, @pablo-gar

Would you please review the draft in the top-level summary comment under Design. (I cannot submit a PR for this field because its schema version is unknown at this time.)

Comments, LGTM, or emojis all accepted. Also feel free to edit in place.

pablo-gar commented 4 months ago

@brianraymor I think it should be "proteomics" for spatial proteomics [EFO:0700000] and its descendants. @jahilton can confirm

otherwise LGTM

brianraymor commented 4 months ago

I think it should be "proteomics

Doh. Cut-n-paste error. Corrected.

jahilton commented 4 months ago

One too many 0 - mCT-seq [EFO:~0~0030060] Need to add "...and its descendants" to sci-RNA-seq [EFO:0010550]. Otherwise, sci-RNA-seq3 is not covered. Could add "...and its descendants" to scATAC-seq [EFO:0010891] and ditch the "10x scATAC-seq [EFO:0030007]" row

Potential risks with relying on descendants for this one. Some hypotheticals:

I think we're focused on transcriptomics enough to be resistant to those rare occurrences, but just wanted to raise them.

brianraymor commented 4 months ago

One too many 0 - mCT-seq [EFO:~0~0030060]

Good catch. I owe you a $1. Corrected.

Need to add "...and its descendants" to sci-RNA-seq [EFO:0010550]. Otherwise, sci-RNA-seq3 is not covered.

Added.

Could add "...and its descendants" to scATAC-seq [EFO:0010891] and ditch the "10x scATAC-seq [EFO:0030007]" row

The problem is that 10x multiome is a descendant of scATAC-seq.

Potential risks with relying on descendants for this one. Some hypotheticals:

  • If EFO does the appropriate thing and moves 10x multiome under both 10x transcription profiling? Fairly confident that won't happen unless we ask for it.
  • If a term is created that is a descendant of both spatial proteomics and spatial transcriptomics I think we're focused on transcriptomics enough to be resistant to those rare occurrences, but just wanted to raise them.

This could be part of the review when the schema updates EFO in a version?

jychien commented 4 months ago

All the spatial assays are represented as descendants of either spatial transcriptomics or spatial proteomics. Is finer granularity is required?

Not that we're accepting this assay, but FYI, NanoString digital spatial profiling is a child of spatial transcriptomics and it has in its definition 'spatial analysis of RNA and protein'. Other than that, the descendants of spatial transcriptomic or proteomics look good to me.

brianraymor commented 4 months ago

I can decompose the spatial cases into individual supported assays if that's preferable.

jahilton commented 4 months ago

The problem is that 10x multiome is a descendant of scATAC-seq.

👍

I don't think further decomposing is needed. I think adding a review step whenever EFO is updated should be sufficient. And establishing a 'supported assay' list will also help as it will narrow the scope of that review

jahilton commented 4 months ago

With the updated submission policy, our only smFISH Datasets (which are private) will be removed, and no more will be accepted. So the smFISH and its descendants row can be simplified to MERFISH EFO:0008992

brianraymor commented 4 months ago

Updated. Added a note to Update requirements for suspension_type.

brianraymor commented 4 months ago

Per conversation with @jahilton and @BAevermann - it does not currently make sense to allow "epigenomics" as a value for mCT-seq. It is unsupported by the updated submission policy. There are no published datasets with this assay+modality combination. It's ~struck~ above.

brianraymor commented 3 weeks ago

Based on the renewed discovery for 10X multiome, I'm reverting this issue from schema 5.2.0 and re-opening.