AlexsLemonade / scpca-nf

scpca-nf is the Nextflow workflow for processing Single-cell Pediatric Cancer Atlas Portal data
BSD 3-Clause "New" or "Revised" License
12 stars 2 forks source link

Update AnnData output for improved scanpy compatibility #773

Closed jashapiro closed 1 month ago

jashapiro commented 1 month ago

Is your feature request related to a problem? Please describe.

When working with scanpy and related tools (SEACells), I encountered a few issues that we might want to address when processing data. I am not sure whether these changes are required for full compatibility with CellXGene, but we probably want to look into it.

In particular:

Some other things we might consider doing during/after conversion:

Describe the solution you'd like

The first question is whether we need to rename the arrays and change their types for compatibility with cellXgene. If we do, or if the scanpy format is compatible, then we should replace our current DataFrames with matrices, otherwise we should keep both.

Then we need to determine if the types are changing at conversion or during input.

We also need to decide if we are going to be able to convert the variance arrays, and if not, how best to output them to allow conversion.

I expect there will be changes in both sce_to_anndata.R and move_counts_anndata.py (which might be renamed to be more general) to complete this task.

Describe alternatives you've considered

We could recalculate the PCA and UMAP at this stage as well, which would guarantee we fill in all values expected. For UMAP, this seems fine to me, as I don't really care that the UMAP for SCE and AnnData are identical, but it doesn't seem like the preferred solution for PCA.

Additional context

Some of the conversions/recalculations I was doing in prep for SEACells are here, but they do not really show the structure of the modified objects after running pca, neighbor, and umap functions: https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/792d88662bc88ba4d865cfb6d1b73f2387bca9b8/analyses/metacells/scripts/run-seacells.py#L25-L39

allyhawkins commented 1 month ago

the reduced dimension matrices are currently stored as in adata.obsm as X_PCA and X_UMAP, but when created using scanpy, these are given the names X_pca and X_umap, respectively

Looking at the requirements for CELLXGENE, they just state the X_ prefix must be present, so I think that we can change. https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#obsm-embeddings

the matrices are stored after conversion as pandas DataFrames, but should be numpy.ndarray according to the AnnData spec. This may be an error in zellkonverter conversion (I have not checked what they look like with the latest conversion) or perhaps we need to convert to matrices before sending them to zellkonverter.

I'm assuming you are talking about the PCA/UMAP matrices? If so, then yes those should be numpy.ndarray and is also mentioned in the CZI schema I linked above. We should definitely address that.

jaclyn-taroni commented 1 month ago

These changes are now in main.