Teichlab / celltypist

A tool for semi-automatic cell type classification
https://www.celltypist.org/
MIT License
301 stars 46 forks source link

AnnData Backed Mode Support #131

Open kennypavan opened 2 months ago

kennypavan commented 2 months ago

Hello,

I'm attempting to train a large model from a AnnData object; however, memory issues persist when opening the file on our HPC with 512Gb of RAM. naturally, I've attempted to open a stream using the Anndata "backed" parameter and received the error:

> train.py:Line 341 
> flag = indata.sum(axis = 0) == 0
> AttributeError: 'Dataset' object has no attribute 'sum'

This error seems reasonable as many of the aggregating functions wouldn't have access to the entire AnnData object. Increasing memory beyond 512Gb for this task is a critical resource limitation. Before attempting to mitigate this by extending the train function to support the backed mode, I'm wondering if there's a solution for processing large scale atlas level datasets with >4 million cells?

Thank you,

ChuanXu1 commented 2 months ago

@kennypavan, CellTypist does not support backed mode for the time being. You could load your raw count data for example, normalize+log1p the data, subset into HVGs, write it out as a new anndata, and load it for training. Note you need to use check_expression = False and feature_selection = False for this data during training. In addition, you can also subset cells.

kennypavan commented 2 months ago

@ChuanXu1 Thank you for the suggestions—I'll explore if preprocessing and removing non-HVGs will work for our use case. Much appreciated!