chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
37 stars 23 forks source link

Misformatted datasets/explorer instances in the MSK #693

Closed MaximilianLombardo closed 10 months ago

MaximilianLombardo commented 11 months ago

Bug report

Hi all - a user identified that plotting gene expression is not functional in all of the explorer instances for this collection: MSK SPECTRUM – Ovarian cancer mutational processes drive site-specific immune evasion.

When searching a gene in the quick gene search, the gene appears and can be added to the UI. The expected histogram of gene expression values (distribution of expression values across all cells in the dataset) does not appear. This behavior is only observed for some genes (see screenshot below). Since other collections don't seem to be affected, it may be a curation/dataset formatting issue .

image

cc: @dominikglodzikhms

jahilton commented 11 months ago

Correct CELLxGENE URL: https://cellxgene.cziscience.com/collections/4796c91c-9d8f-4692-be43-347b1727f9d8 Quick look at the Dendritic Cells Dataset and there does seem to be a general disagreement between the raw count layer & normalized layer for a handful of genes. So we'll keep this issue to investigate and possibly revise.

However, MALAT1 & ATM currently have all 0 values for that Dataset, which can explain what you're seeing in Explorer. So if there is a separate issue that you expect a different Explorer experience, then that should be a separate ticket.

MaximilianLombardo commented 11 months ago

Thanks @jahilton

However, MALAT1 & ATM currently have all 0 values for that Dataset, which can explain what you're seeing in Explorer. So if there is a separate issue that you expect a different Explorer experience, then that should be a separate ticket

interesting, was that for both the raw and normalized layer? I've tried to replicate the error in other datasets/collections to no avail. This is the only dataset where I've been able to plot genes with 0 expression.

MaximilianLombardo commented 11 months ago

Ps just sharing some additional context w.r.t. what I got for the expression values in the dendritic cells object for MALAT1 - So I am seeing non-zero values in the normalized layer

image
jahilton commented 11 months ago

@jychien & I have tracked down the curation side of the issue. The X layer in these Datasets have had significant gene filtering relative to the raw layer. Looks like not all filtered genes were marked as feature_is_filtered=True. But that is why some genes have raw data but all 0s in X.

(@MaximilianLombardo not sure about the array values you're seeing, but the UMAP plot is showing the expression from the raw layer - the plot may default use_raw=True)

jahilton commented 10 months ago

✅ feature_is_filtered has been correct & the revision Published