How to encode datasets where raw and normalized count matrices have different shapes.

ambrosejcarr commented 3 years ago

@MaximilianLombardo and @pablo-gar identify that we often receive datasets where the "normalized" count data contain a subset of the gene features of the "raw" count data. This occurs because toolchains tend to filter genes to generate better clusters and low dimensional embeddings. This filtering typically removes low variance genes, and genes that are thought to explain more technical variation (ribosomal) or confounding stress information (mitochondrial) than interesting biological information.

While filtering is helpful for downstream steps in the initial analysis, it produces a data reuse problem when other scientists want to explore the expression of a specific gene or set of genes. Many times their genes of interest are not present in the filtered, normalized matrix. It also causes a visualization problem for cellxgene. Since normalized counts are the data that we visualize, and users may only visualize features that are present in the matrix.

I would prefer that data submitters use distance metrics that are aware of feature variance, so they don't need to filter variable genes. We'd also prefer that they regress out confounding signatures instead of filtering "troublesome genes" like mitochondrial and ribosomal genes. However, this is not common practice and the schema's primary goal is to best represent scientific data.

Proposed solution: @MaximilianLombardo suggests the following solution:

Our schema should require that raw and normalized data have the same set of features.
To meet this requirement, when we receive submissions, any feature present in raw but not in normalized should be added to normalized and its values filled with np.nan
An amendment is written to the schema explaining why including all features in the normalized dataset increases the breadth of users who will be able to reuse the data, and maximizes the value of submitted data, and explaining how step (2) above is applied to data that do not meet this requirement.

Corollary: This decision will affect the format of our downloaded files, which can't support matrices of different shapes.

Unresolved question: How do we visualize these np.nan columns in the explorer? cc @signechambers1 @colinmegill

cc @jahilton we're considering this approach instead of the one I suggested to you. The difference here is instead of subsetting to the features of the normalized matrix, we expand to the features of the raw matrix.

pablo-gar commented 3 years ago

I disagree with the proposed solution because filling missing genes with NaN has the following disadvantages (but there can be more we haven't thought about):

Potential miss interpretation of analyses including differential gene expression, where's my gene, gene clustering, etc. A user can make assumptions not limited to these ones: a genes was not covered by reads, a gene was lowly expressed, a gene was filtered out.
It could break cellxgene-independent tools.

For me an ideal solution is one that:

Requires to always have the full raw data.
Allows for storing a "processed" normalized (with lesser genes) expression matrix, i.e. extra layer or extra h5ad.
Requires normalized data with the same dimensions as raw. This doesn't have to be done by the authors as we can apply a standard normalization pipeline on the raw data.

I think we should think about alternative solutions before choosing the proposed one, or at least make sure that this solution is an appropriate one by contacting experts on the field.

ambrosejcarr commented 3 years ago

Thanks Pablo, I agree with you that the above proposal isn't ideal. I have some questions about your ideal solution.

Allows for storing a "processed" normalized (with lesser genes) expression matrix, i.e. extra layer or extra h5ad.

What reuse value do you see in capturing this data? Is it to capture the filtering output of the authors? If yes, is there anything else we'd need to capture to ensure we're passing along the value to other users of the data? For example, rationales for filtering?

Requires normalized data with the same dimensions as raw. This doesn't have to be done by the authors as we can apply a standard normalization pipeline on the raw data.

Is this the layer you'd use to visualize data in cellxgene in cases where the author's normalized matrix is missing genes? If yes, there are two challenges we've seen in the past with this approach:

While there are a number of "good" normalization strategies, it is difficult to pick one for each data, and some strategies require custom covariates to be used to remove technical variation. It's my experience that authors typically do a better job than third-party curators like us because they know their data better.
Assuming there are cases where we make different choices than the authors, it often means that the data no longer match the author provided embedding. Or the clusters. Or the cell type labels. When aggregating across studies, this is less problematic because we can average over large pools data, however for individual experiments it sows confusion equal to the divergence between normalization algorithms. It leads to feature creep, where we provide a "standard tertiary analysis, up to and including clustering". The DCP is showing that to be a fraught process, particularly because there's no good way (yet) for us to type the resulting cells.

I'm wondering if you see ways to mitigate these issues.

pablo-gar commented 3 years ago

What reuse value do you see in capturing this data? Is it to capture the filtering output of the authors? If yes, is there anything else we'd need to capture to ensure we're passing along the value to other users of the data? For example, rationales for filtering?

I was thinking about having an "intact" copy of the data as processed by the authors. I don't think we need to capture any more information if this is stored as a layer of the anndata. In schema v1.1.0 we require to have a high-level description of what transformations were applied in each layer.

Is this the layer you'd use to visualize data in cellxgene in cases where the author's normalized matrix is missing genes?

Yes

While there are a number of "good" normalization strategies, it is difficult to pick one for each data, and some strategies require custom covariates to be used to remove technical variation. It's my experience that authors typically do a better job than third-party curators like us because they know their data better.

I agree, this would be tradeoff. If we are performing internal curation we can try different iterations of normalizations to find the most adequate one (we would need to define what this means with the authors)

Assuming there are cases where we make different choices than the authors, it often means that the data no longer match the author provided embedding. Or the clusters. Or the cell type labels. When aggregating across studies, this is less problematic because we can average over large pools data, however for individual experiments it sows confusion equal to the divergence between normalization algorithms. It leads to feature creep, where we provide a "standard tertiary analysis, up to and including clustering". The DCP is showing that to be a fraught process, particularly because there's no good way (yet) for us to type the resulting cells.

I'm not sure I understand what you mean by not matching with embeddings or clusters or cell types. Those are cell-level metadata and can preserved as long as all cells were used for their creation. I do understand that the information that was used to create those embeddings would not be longer present, but it would not change the embeddings themselves -- I however can see how this may lead to its own bag of miss-interpretations.

ambrosejcarr commented 3 years ago

I do understand that the information that was used to create those embeddings would not be longer present, but it would not change the embeddings themselves -- I however can see how this may lead to its own bag of miss-interpretations.

This is what I was trying to highlight. :+1:

The embedding is supposed to be a low dimensional representation of the data. But, if we renormalized using a different algorithm than the authors without generating new embeddings, this would no longer be the case. Because of this, genes painted on the dataset may not distribute in a way that is consistent with the embedding. And If we updated the embeddings to fix this, then the clusters and cell type labels may no longer match the grouping observed in the embedding. There is a domino effect. And having to generate new cell type labels is high effort.

ambrosejcarr commented 3 years ago

We discussed this issue today. We discovered that AnnData files support different shapes in AnnData.raw and AnnData.X, which means there is no schema compatibility issue, and no padding needs to be done of AnnData.X to enable us to ingest those datasets.

To optimize data reuse, we agreed that we will ask authors not to remove genes from AnnData.X unless the rationale is to remove specific confounding. This will enable us to: (a) visualize genes in cellxgene that may not have been relevant for the original use case, but are of interest to a data consumer. (b) retain flexibility to display author-normalized data in cross-dataset use cases, if we decide we want to go that route.

We will update schema.md to reflect this request.

There is still a UX issue in the explorer, where some datasets may be missing genes, and there is no explanation to the user for the lack of consistency there. cc @signechambers1

jahilton commented 3 years ago

We have a matrix from a contributor in the R object format. It contains 2 layers - scTranform & LogNormalized. Each layer has a different number of genes. We'll add in a raw layer to this without filtering any genes, resulting in 3 layers, each with a different shape. Can you confirm that AnnData supports this case?

ambrosejcarr commented 3 years ago

@jahilton thank you for bringing this up. Here's where we landed:

Any non-raw layer must have the same shape, and have the same variable names.
We suggest padding layers that have subsets of columns with np.nan values (if dense matrices) or 0 (if sparse) so they can be submitted.

In your case, I believe scTransform has a subset of the columns in LogNormalized, so you would pad scTransform to make the shape match LogNormalized. The raw layer may differ in shape from the other two.

signechambers1 commented 3 years ago

@ambrosejcarr @pablo-gar @colinmegill and I discussed, we don't think there should be any issues with np.nan values on the front end. If a user loads a gene that only has these values, they will see a large bar on the left hand side of the histogram and coloring by these genes will turn the main image grey.

If you can point us to a dataset with the np.nan values we can load it up and confirm.

brianraymor commented 3 years ago

@ambrosejcarr - I believe that your final model were addressed in schema 2.0.0. Please close this issue if you're in agreement.

brianraymor commented 2 years ago

Closing due to no response.

chanzuckerberg / single-cell-curation

How to encode datasets where raw and normalized count matrices have different shapes. #23