Teichlab / celltypist

A tool for semi-automatic cell type classification
https://www.celltypist.org/
MIT License
260 stars 40 forks source link

RAM usage by converting from sparse to dense #29

Closed chbeltz closed 2 years ago

chbeltz commented 2 years ago

Is there a reason why the input data is converted to an np.array rather than accepting sparse matrices when running .train? Skimming the remainder of the code, I cannot seem to find anything that would not also work with sparse matrices. The reason I am asking is that this conversion to an array seems to be the reasons I find myself running out of RAM quite frequently when working with larger datasets.

Thanks

prete commented 2 years ago

Hi @chbeltz, thank you for using CellTypist!

I think the issue you're seeing has to do with the scaling technique that's used. Have a look at classifier.py > celltype, particularly at this:

        logger.info(f"⚖️ Scaling input data")
        means_ = self.model.scaler.mean_[lr_idx]
        sds_ = self.model.scaler.scale_[lr_idx]
        self.indata = (self.indata[:, k_x_idx] - means_) / sds_
        self.indata[self.indata > 10] = 10

Unfortunately, self.indata[:, k_x_idx] - means_ densifies the matrix, possibly leading to you running out of RAM with large datasets.

chbeltz commented 2 years ago

Ah, I wasn't aware that sparse matrices were densified upon substraction by a vector. That's unfortunate.

Thanks!

ChuanXu1 commented 2 years ago

@chbeltz, adding to this point, during training, scaling will also densify the matrix. You can skip the feature selection step which uses all genes present in your data. That is, subset your data to a subset of useful genes (e.g., HVGs), and disable the options of feature selection and expression check during training, which would also reduce the RAM consumption.

chbeltz commented 2 years ago

Can you imagine implementing a use_sparse switch that preserves sparsity by skipping the subtraction by the mean during scaling? Same principle as the with_mean option of sklearn.preprocessing.StandardScaler.

ChuanXu1 commented 2 years ago

@chbeltz, for SGD logistic regression, according to the sklearn package

For best results using the default learning rate schedule, the data should have zero mean and unit variance.

Skipping the subtraction by the mean may not be a good practice, wdyt?

chbeltz commented 2 years ago

@ChuanXu1 I have not been able to find a lot of empirical data on the effect of non-zero centered distributions of the input onto the performance of SGD, so I'm having a hard time weighing the pros and cons. However, I feel that if the alternative is that people with limited computing resources may decide to not use the software at all, it may be preferable to provide an option that may lead to less than optimal results, but to results nonetheless.

ChuanXu1 commented 2 years ago

@chbeltz, it sounds reasonable. I added these changes (a with_mean paramter in celltypist.train) to optimize the RAM usage during training at a possible cost of reduced performance dfb11e05e3e95fb4906a24ae4f890988eba13031

This parameter will be available at (next version of) CellTypist. Thx!

chbeltz commented 2 years ago

Much appreciated, thank you!!