Closed chbeltz closed 2 years ago
Hi @chbeltz, thank you for using CellTypist!
I think the issue you're seeing has to do with the scaling technique that's used. Have a look at classifier.py > celltype, particularly at this:
logger.info(f"⚖️ Scaling input data")
means_ = self.model.scaler.mean_[lr_idx]
sds_ = self.model.scaler.scale_[lr_idx]
self.indata = (self.indata[:, k_x_idx] - means_) / sds_
self.indata[self.indata > 10] = 10
Unfortunately, self.indata[:, k_x_idx] - means_
densifies the matrix, possibly leading to you running out of RAM with large datasets.
Ah, I wasn't aware that sparse matrices were densified upon substraction by a vector. That's unfortunate.
Thanks!
@chbeltz, adding to this point, during training, scaling will also densify the matrix. You can skip the feature selection step which uses all genes present in your data. That is, subset your data to a subset of useful genes (e.g., HVGs), and disable the options of feature selection and expression check during training, which would also reduce the RAM consumption.
Can you imagine implementing a use_sparse switch that preserves sparsity by skipping the subtraction by the mean during scaling? Same principle as the with_mean option of sklearn.preprocessing.StandardScaler.
@chbeltz, for SGD logistic regression, according to the sklearn package
For best results using the default learning rate schedule, the data should have zero mean and unit variance.
Skipping the subtraction by the mean may not be a good practice, wdyt?
@ChuanXu1 I have not been able to find a lot of empirical data on the effect of non-zero centered distributions of the input onto the performance of SGD, so I'm having a hard time weighing the pros and cons. However, I feel that if the alternative is that people with limited computing resources may decide to not use the software at all, it may be preferable to provide an option that may lead to less than optimal results, but to results nonetheless.
@chbeltz, it sounds reasonable. I added these changes (a with_mean
paramter in celltypist.train) to optimize the RAM usage during training at a possible cost of reduced performance dfb11e05e3e95fb4906a24ae4f890988eba13031
This parameter will be available at (next version of) CellTypist. Thx!
Much appreciated, thank you!!
Is there a reason why the input data is converted to an np.array rather than accepting sparse matrices when running .train? Skimming the remainder of the code, I cannot seem to find anything that would not also work with sparse matrices. The reason I am asking is that this conversion to an array seems to be the reasons I find myself running out of RAM quite frequently when working with larger datasets.
Thanks