Update AnnData output for improved scanpy compatibility

jashapiro commented 4 months ago

Is your feature request related to a problem? Please describe.

When working with scanpy and related tools (SEACells), I encountered a few issues that we might want to address when processing data. I am not sure whether these changes are required for full compatibility with CellXGene, but we probably want to look into it.

In particular:

the reduced dimension matrices are currently stored as in adata.obsm as X_PCA and X_UMAP, but when created using scanpy, these are given the names X_pca and X_umap, respectively
the matrices are stored after conversion as pandas DataFrames, but should be numpy.ndarray according to the AnnData spec. This may be an error in zellkonverter conversion (I have not checked what they look like with the latest conversion) or perhaps we need to convert to matrices before sending them to zellkonverter.

Some other things we might consider doing during/after conversion:

adding a highly_variable column with boolean values to the var table. This can be done with:

adata.var["highly_variable"] = adata.var.gene_ids.isin(adata.uns["highly_variable_genes"])

add adata.uns["pca"] with the parameters and variance weights from the PCA. This would require some effort as the structure of the object isn't the easiest to deal with and I don't think zellkonverter would handle it in its current form. For reference, the structure of that object as created by scanpy:
```
{
'params': {
'zero_center': True,
'use_highly_variable': True,
'mask_var': 'highly_variable'
},
'variance': array([529.79105617, 304.15370252, ...]),
'variance_ratio': array([0.20832395, 0.11959904, ...])
}
```
There is also a similar object for umap adata.uns["umap"] which contains only {'params': {'a': 0.5830300205483709, 'b': 1.334166992455648}}, but this seems less useful to keep. The adata.uns["neighbors"]` object is also created as a step in UMAP creation, so we might want to consider if that should be included as well.

Describe the solution you'd like

The first question is whether we need to rename the arrays and change their types for compatibility with cellXgene. If we do, or if the scanpy format is compatible, then we should replace our current DataFrames with matrices, otherwise we should keep both.

Then we need to determine if the types are changing at conversion or during input.

We also need to decide if we are going to be able to convert the variance arrays, and if not, how best to output them to allow conversion.

I expect there will be changes in both sce_to_anndata.R and move_counts_anndata.py (which might be renamed to be more general) to complete this task.

Describe alternatives you've considered

We could recalculate the PCA and UMAP at this stage as well, which would guarantee we fill in all values expected. For UMAP, this seems fine to me, as I don't really care that the UMAP for SCE and AnnData are identical, but it doesn't seem like the preferred solution for PCA.

Additional context

Some of the conversions/recalculations I was doing in prep for SEACells are here, but they do not really show the structure of the modified objects after running pca, neighbor, and umap functions: https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/792d88662bc88ba4d865cfb6d1b73f2387bca9b8/analyses/metacells/scripts/run-seacells.py#L25-L39

allyhawkins commented 4 months ago

the reduced dimension matrices are currently stored as in adata.obsm as X_PCA and X_UMAP, but when created using scanpy, these are given the names X_pca and X_umap, respectively

Looking at the requirements for CELLXGENE, they just state the X_ prefix must be present, so I think that we can change. https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#obsm-embeddings

the matrices are stored after conversion as pandas DataFrames, but should be numpy.ndarray according to the AnnData spec. This may be an error in zellkonverter conversion (I have not checked what they look like with the latest conversion) or perhaps we need to convert to matrices before sending them to zellkonverter.

I'm assuming you are talking about the PCA/UMAP matrices? If so, then yes those should be numpy.ndarray and is also mentioned in the CZI schema I linked above. We should definitely address that.

jaclyn-taroni commented 3 months ago

These changes are now in main.

AlexsLemonade / scpca-nf

Update AnnData output for improved scanpy compatibility #773