Closed jashapiro closed 3 months ago
the reduced dimension matrices are currently stored as in adata.obsm as X_PCA and X_UMAP, but when created using scanpy, these are given the names X_pca and X_umap, respectively
Looking at the requirements for CELLXGENE, they just state the X_
prefix must be present, so I think that we can change.
https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#obsm-embeddings
the matrices are stored after conversion as pandas DataFrames, but should be numpy.ndarray according to the AnnData spec. This may be an error in zellkonverter conversion (I have not checked what they look like with the latest conversion) or perhaps we need to convert to matrices before sending them to zellkonverter.
I'm assuming you are talking about the PCA/UMAP matrices? If so, then yes those should be numpy.ndarray
and is also mentioned in the CZI schema I linked above. We should definitely address that.
These changes are now in main
.
Is your feature request related to a problem? Please describe.
When working with
scanpy
and related tools (SEACells), I encountered a few issues that we might want to address when processing data. I am not sure whether these changes are required for full compatibility with CellXGene, but we probably want to look into it.In particular:
adata.obsm
asX_PCA
andX_UMAP
, but when created usingscanpy
, these are given the namesX_pca
andX_umap
, respectivelynumpy.ndarray
according to the AnnData spec. This may be an error inzellkonverter
conversion (I have not checked what they look like with the latest conversion) or perhaps we need to convert to matrices before sending them tozellkonverter
.Some other things we might consider doing during/after conversion:
highly_variable
column with boolean values to thevar
table. This can be done with:adata.uns["pca"]
with the parameters and variance weights from the PCA. This would require some effort as the structure of the object isn't the easiest to deal with and I don't think zellkonverter would handle it in its current form. For reference, the structure of that object as created by scanpy:There is also a similar object for umap
adata.uns["umap"]
which contains only{'params': {'a': 0.5830300205483709, 'b': 1.334166992455648}}
, but this seems less useful to keep. The adata.uns["neighbors"]` object is also created as a step in UMAP creation, so we might want to consider if that should be included as well.Describe the solution you'd like
The first question is whether we need to rename the arrays and change their types for compatibility with cellXgene. If we do, or if the
scanpy
format is compatible, then we should replace our current DataFrames with matrices, otherwise we should keep both.Then we need to determine if the types are changing at conversion or during input.
We also need to decide if we are going to be able to convert the variance arrays, and if not, how best to output them to allow conversion.
I expect there will be changes in both sce_to_anndata.R and move_counts_anndata.py (which might be renamed to be more general) to complete this task.
Describe alternatives you've considered
We could recalculate the PCA and UMAP at this stage as well, which would guarantee we fill in all values expected. For UMAP, this seems fine to me, as I don't really care that the UMAP for SCE and AnnData are identical, but it doesn't seem like the preferred solution for PCA.
Additional context
Some of the conversions/recalculations I was doing in prep for SEACells are here, but they do not really show the structure of the modified objects after running pca, neighbor, and umap functions: https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/792d88662bc88ba4d865cfb6d1b73f2387bca9b8/analyses/metacells/scripts/run-seacells.py#L25-L39