biosurf / cyCombine

Robust Integration of Single-Cell Cytometry Datasets
Other
24 stars 6 forks source link

"memory not mapped" #52

Open franpcozar opened 3 months ago

franpcozar commented 3 months ago

I am using R version 4.3.3 on a x86_64-pc-linux-gnu (64-bit) system with 2TB RAM.

My datasets consists on 311 files with a total 55188271 cells and 40 markers.

When I attempt to correct for the batch effect in all cells of my dataset, my R session crashes, specifically when running the function batch_correct(). All the previous steps worked nicely. I followed the pipeline describe in https://biosurf.org/cyCombine_CyTOF_1panel.html#Checking_for_batch_effects

Here is a screenshot of the error: image

It seems to be a memory problem, but I don't know why as I have 2TB of RAM.

Also, I conducted a test by downsampling to 10000 cells per file (3110000 total cells), and the pipeline worked perfectly.

I have updated all my packages (including the necessary packages), and also tried to running it in R instead of RStudio.

I am looking forward to your response :)

shdam commented 3 months ago

Hey there,

Thank you for using cyCombine!

It looks like the kohonen::som clustering method has a hard time allocating memory to such a big dataset, unfortunately.

I am working on a bit of an overhaul of cyCombine that should improve memory performance significantly. But it is nowhere near complete. I have made a minor update in the dev branch you can try to install. This allows you to use the mode argument from kohonen::som - in hopes the batch mode is more memory efficient. After installing the development version of cyCombine, try running batch_correct with mode = "batch" to see if that solves the issue.

Otherwise, you will have to find an alternative clusting method to kohonen::som. FlowSOM, forexample, works directly on a flowset, which might be more efficient (it requires converting your dataframe to a matrix/flowFrame first - be mindful of matrix orientation and included markers).
You could normalize your uncorrected set with cyCombine::normalize(), correct with FlowSOM (or another algorithm), and then run batch_correct on the unnormalized data with label set to the clustering labels for each cell. Better yet, you can then split the data into each cluster and run batch_correct on these individually, setting label to the clustering number value. This will significantly improve memory usage, the only bottleneck being the clustering step.

These are the principles behind the future overhaul, but it will take me some time to finish the implementation.

Please let me know if any of the two approaches solves the memory challenge!

Best regards, Søren

franpcozar commented 3 months ago

Thanks for your answer! I installed the dev branch of cyCombine and tried running batch_correctwith mode = "batch". However I got the error :

Error in match.args(mode) : could not find function "match.args".

I wondered if you meant to use the function match.arg() instead of match.args(). Or maybe I forgot to install some dependencies.

shdam commented 3 months ago

Whoops, I was a bit quick in the implementation. That should be fixed now - thank you for pointing it out :)

Best regards, Søren