AnnData to Seurat: fix dimensions of raw layer when needed

pablo-gar commented 3 years ago

This is to address @nh3 https://github.com/cellgeni/sceasy/pull/25#issuecomment-783803808

When raw.X and X have different number of genes and main_layer=['data'|'scaled_data'], it resizes them to have the same number of genes in anndata2seurat.

nh3 commented 3 years ago

Thank you for the PR again!

Sorry for being a bit lengthy here to explain:

For each assay, Seurat object requires the same dimension for counts and data, but allows fewer genes in scale.data, as the latter was meant for only highly variable genes, a subset of those in the former.

Scanpy on the other hand allows different number of genes in raw.X and X without forcing which type of data is stored in each slot. The common scenario (referred to as scenario A) is, if ones follows the Scanpy tutorial, to store normalised and log transformed data of all pass QC genes in raw.X, and scaled data for highly variable genes (or all passQC genes if subset=False was passed to highly_variable_genes()) in X. So, they can go directly into Seurat data and scale.data without any issue.

I guess your PR is for Scanpy objects submitted to CZI, where X stores normalised and log transformed data for cellxgene and raw.X stores counts for raw data distribution (scenario B). In this case, there is no guarantee that the two would have the same dimension (normally they would differ as X would contain pass QC genes and raw.X was meant to contain all genes, though there's always variation in the data received), so something has to be done to get both into Seurat. I think trimming raw.X is a reasonable and perhaps the only viable solution. But would you mind making it specific to scenario B please? Otherwise it would undesirably trim raw.X in scenario A too.

Again, many thanks for your PR!

nh3 commented 3 years ago

You could now merge this @pablo-gar?

pablo-gar commented 3 years ago

Thanks for the detailed description, all of it makes sense and your changes are on point!

Yes, we do see AnnData where X and raw.X have different number of gene so resizing seems appropriate. We thought that a solution could be to store X in scale.data and raw.X in counts, however the `scale.data' slot seems to be reserved for a specific transformation by Seurat and it would be misleading to store anything that doesn't correspond to that.

Your commits have been reflected in my branch but I can't merge to master. Unless you see any other issues, would you mind merging it?

cellgeni / sceasy

AnnData to Seurat: fix dimensions of raw layer when needed #28