Open littlemingone opened 2 months ago
Hi @littlemingone, do you think it's possible your computer was running out of memory for the tests that resulted in std::bad_alloc
errors? Each matrix you're requesting would take around 16GB of RAM for the final object, but BPCells will use several times that amount in intermediate steps -- on my own laptop I observed ~50GB of RAM usage before I had to force-quit. If it worked on someone else's computer, I wonder if it's just a matter of them having a lot more RAM available.
Do you have a reason you need to make such large dense matrices from BPCells objects? This isn't really a use-case I've optimized for.
As for functions to read large csv, that's not currently on the roadmap for BPCells (though we're always open to look at pull requests if others have new features they have written and want to contribute). I might suggest trying the data.frame
or vroom
libraries for fast reading of CSVs. Those libraries often also have an option to read a subset of the rows of a csv (e.g. via skip
and n_max
in vroom), which you might be able to use to read the file in several chunks to reduce your memory usage. It looks like the dataset you linked would take 38GB of RAM as a dense matrix, so you'd probably benefit from splitting it up into 5-10 chunks.
I'd also double-check your free RAM with e.g. task manager on windows, since spilling out of RAM into swap space is also a potential cause of extreme slowness.
-Ben
In fact, I noticed this problem from a bug of Seurat::RunPCA()
. It should be a bug I guess.
As I said, I tried to handle a huge count matrix with more than 120K cells, and of course I used BPCells to accrelate the analyse. But at the RunPCA()
step, I got the error about dgCMatrix exceeding 2^31 values.
* dgCMatrix objects cannot hold more than 2^31 non-zero entries
* Input matrix has 3365020071 entries
RunPCA()
step use scale.data
layer so most of the value is not zero. I use all gene to generate the scale.data
in case I need some gene to draw a Heatmap. I ran the RunPCA()
with no features parameter, so it should use only 2000 hvgs and 2000*120000 shouldn't exceed the 2^31 limits. But error info said the matrix have 3365020071 entries, which is precisely the size of the whole scale.data matrix. So the Seurat::RunPCA()
will change the whole scale.data matrix instead only the part it need. And with a data at 100K cell level, the complete scale.data will 100% exceed the dgCMatrix limit.
After the 2^31 error, I tried to change the scale.data into a dense matrix to fix the problem, and I got the std::bad_alloc
error. At last , I fixed it by running the ScaleData
with only 2000 hvgs first, then RunPCA()
.After generating the pca data, I ran the ScaleData
again with all gene.
Maybe handling a huge dense matrix is not the use-case you expect for, but with the current version Seurat, BPCells will meet this situation.
And about the RAM cost, the (4.4.1 R doc)[https://search.r-project.org/R/refmans/base/html/Memory-limits.html] says all objects will be hold in virtual memory. I have raised my virtual memory to 150G before the test, and my physical is 64GB, so I think it should be OK. But there were some other tasks running on my computer at that time, so I am not sure. But when I called a small subset of the data, I didn't got the std::bad_alloc
error. I will try later.
Ah, I see. It sounds like you are running into a Seurat issue, where RunPCA()
inadvertently converts BPCells to full in-memory matrices during Seurat's internal PrepDR5
function. I submitted a fix to Seurat in this pull request, which is now incorporated in the develop
branch of Seurat, but has yet to make it to the main branch or CRAN.
I've also described a workaround in this comment, which should avoid the issue even without re-installing Seurat from the develop
branch.
Hope that helps fix your issue
-Ben
I tried to use
as.matrix()
to change aIterableMatrix
into a normal matrix ordgCMatrix
, the matrix is huge but smaller than 2^31 so it shouldn't exceed the dgCMatrix limit, but I still got an error. The matrix is a Seurat scale data, so there are no zeros in it.matrix info
session info
But I asked someone else tried the same thing, it worked fine. And I tried in a new installed R-4.4.1 (cran R installer) with
install.packages("BPCells", repos = c("https://bnprks.r-universe.dev", "https://cloud.r-project.org"))
, readed the on-disk-data created before, got the same error.And, by the way, I think we might need a function that can read csv or tsv files as
dgCMatrix
or some other low memory cost object like the BPCells on disk data. So we can read a huge csv without reading it as dataframe first and changing to matrix and dgCMatrix. Some data were stored as csv files. For example this, containing more than 120K cells in a csv file. To analyze this data, I need to read the whole file as data.frame and sequently changed its class byas.matrix()
andas('dgCMatrix')
. It was slow and torment.