chanzuckerberg / single-cell-data-portal

The data portal supporting the submission, exploration, and management of projects and datasets to cellxgene.
MIT License
62 stars 12 forks source link

Enforce canonical for seurat conversion. #7309

Closed Bento007 closed 3 weeks ago

Bento007 commented 2 months ago

Motivation

To convert an anndata dataset to seurat we need to enforce canonical form of the sparse matrixes. Because of how the anndata library read and write the sparse format, the whole dataset must be read into memory. This is an expensive operation that is slow and memory Intensive. Since this enforcement is only required for seurat conversion, the step of enforcing canonical should be move to the seurat conversion container. This will allow us to speed up validation.

Definition of Done

Tasks

nayib-jose-gloria commented 2 months ago

Estimate (incl testing): 2-4 days

@Bento007 for comment if you agree

nayib-jose-gloria commented 2 months ago

From @ivirshup:

Enforcing canonical format is the biggest consumer of memory because it requires reading in the whole X and raw.X matrix.

Btw, this should be possible without reading the whole matrix, since canonicalization should only remove entries. That part shouldn't matter if the canonicalization isn't happening in place though. I can advise on this if needed.

Recommend contacting him for more details to see whether we can implement canonicalization w/o reading in the whole matrix even in the seurat conversion