chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
38 stars 24 forks source link

Should the schema include a counts field #9

Closed brianraymor closed 2 years ago

brianraymor commented 3 years ago

@ambrosejcarr commented on Thu Aug 13 2020

Appetite: ?

This question is limited to 10x scRNA/snRNA and Smart-Seq2-like assays.

Should the schema include a counts field? If so, how is it modeled per framework/assay? UMI counts from 10x for example.


@ambrosejcarr commented on Fri Sep 11 2020

Addressing @mckinsel questions:

  1. What's the technical cost of leaving this optional? The requirement could be "observation IDs for processed cells must be contained in the set of IDs of raw cells". I do not think we should require unfiltered barcodes to be present, so if the cost of being unopinionated is high, I would suggest that raw and processed observation sets should match.
  2. Do you mean the supplementary table that 10x generates? (example link) Do archives capture these data? If we confirm, I'd support an optional "links to more data" section, and decline to hold these data.
  3. I think we need to treat transcripts like a separate data modality that we currently do not support. If we are getting data from users who want to retain transcript information, we should tell them they can choose to collapse their data by gene, but we recognize that decision may compromise their experiment. We should not enforce any Science Program submission requirements for those data at this time, assuming this doesn't become a recognized loophole around data submission. Some thoughts on this below which could seed that epic.

"Detected molecules of RNA per gene" (typical 10x 3' processing), "detected molecules of RNA per transcript" (transcript-aware RNA-seq processing, more commonly associated with SS2), "detected molecules of protein" (CITE-seq, CyTOF, MIBI), and "sequencing reads from promoter regions adjacent to genes" (sc-ATAC-seq) are separate data modalities and we should be aware of that in some way.

They can all be reduced to "observations of gene", and we may want to enable that conversion, but we should be careful, deliberate, and have a separate set of rules for each modality. When we get to CITE-seq data, those naturally correspond better to transcript-level data. the PTPRC gene is a good example of where we'll get tripped up, and in the future I expect we'll start to see phospho (active) and non-phospho (inactive) forms of proteins detected with CITE-seq, introducing additional complexity beyond what's captured at the transcript level.


@ambrosejcarr commented on Fri Sep 11 2020

Created chanzuckerberg/single-cell#56 to track support for other data modalities.


@brianraymor commented on Tue Oct 20 2020

@ambrosejcarr to follow up on Do we want unfiltered barcodes from 10x? We actually got some feedback from one person when shopping around the schema that the answer is yes, thought it was a nice-to-have. The problem is this would not be proper layer in any format as its dimensions are different. and open a new issue as needed. The current position is that the answer is "no".

brianraymor commented 3 years ago

@ambrosejcarr wrote:

I followed up with Malte Luecken on the question of "what is raw data". Malte agreed that we should encourage scientists to filter non-cell barcodes, given our use cases. He suggested that further QC should not be done on the raw matrices (e.g. the selection of highly variable genes), which I think is aligned with our recommendation and makes sense. He made a point that I thought was good about capturing spliced/unspliced counts when they're available:

Malte Luecken [12:31 PM] rather than separate spliced/unspliced... i would suggest just adding this info to the normal counts you collect ... as adata.layers['spliced'] and adata.layers['unspliced']and adata.X or adata.layers['counts'] with the count data then I guess ...