KlugerLab / ALRA

Imputation method for scRNA-seq based on low-rank approximation
MIT License
71 stars 19 forks source link

Large matrix errors (more than 2^31-1 non-zero entries) [on large datasets] #29

Open ghost opened 4 months ago

ghost commented 4 months ago

I'm working with a matrix of over 200,000 cells and 36,000 genes.

"I first tried the 'RunALRA' function in Seurat. Then, I extracted the expression table, converted it into a matrix, and attempted to use ALRA (including alra.low.memory), but encountered the following error.

"Attempting to construct a sparseMatrix with at least 2^31-1 non-zero entries."

It appears that the dgCMatrix conversion process fails because a large matrix exceeds the limit. Modifying the ALRA function code to use a general matrix format instead of dgCMatrix is possible, but operating it realistically is challenging due to the near 100Gb size

If there is a function or method to address this issue with large datasets like mine, I would appreciate any suggestions. Below are the alternatives I am currently considering. I would be grateful if you could share your opinions on them as well.

Currently, I am considering the following three alternatives :

For Alternative A, Imputation is performed for each sample and integrated into one. However, based on the experiences of other users registered in this issue, it seems that normalizing and imputing the integrated data yields more accurate results.

For Alternative B, After normalization is performed on the integrated data, imputation is performed by reducing the number of genes. However, there may be different trends compared to when imputation is performed with the entire gene.

For Alternative C, (If celltype information is known) Immediately perform normalization on the integrated data and then perform subsetting for each celltype to separate them. Imputation is then performed for each cell type and then integrated again. I think this alternative has the advantage of allowing the use of any gene. Additionally, certain genes may not be expressed at all or may be expressed only in certain cell types. I hope that the biological perspective that it can be expressed differently only in certain cells can be applied. Additionally, since I have performed normalization for the entire cell population, so I believe it will be possible to compare the expression levels between cell types in the integrated data after conducting ALRA Imputation for each cell type. If there are any suggestions for revising my thoughts, I would appreciate hearing them.

ghost commented 4 months ago

In my attempts, it seems that "Alternative C" provides more meaningful results in reflecting biological characteristics compared to "Alternative B". In my data, the results such as expression level and proportion of expressing cells for genes that are either expressed or not expressed in certain cell types (and in the comparison between Normal and Tumor as well) are more accurate. For "Alternative B," it seems that in my data, there is a tendency for the expression levels or the proportion of expressing cells to be exaggerated.

ghost commented 4 months ago

@JunZhao1990 @linqiaozhi Firstly, I want to express my gratitude for developing ALRA. It's an incredibly useful tool. However, while analyzing large datasets in R, I've encountered issues related to large matrices. Is there by any chance a way to utilize ALRA within a Python-based environment like Scanpy?

ghost commented 4 months ago

@JunZhao1990 @linqiaozhi Firstly, I want to express my gratitude for developing ALRA. It's an incredibly useful tool. However, while analyzing large datasets in R, I've encountered issues related to large matrices. Is there by any chance a way to utilize ALRA within a Python-based environment like Scanpy?

P.s. At the following link, I found a translation of ALRA into Python code from 6 years ago. Considering the updates so far, do you think this ALRA analysis can still be effectively used in Python? https://github.com/pavlin-policar/ALRA