HelenaLC / CATALYST

Cytometry dATa anALYsis Tools
66 stars 30 forks source link

Clustering >15 million cells with CATALYST cluster #352

Closed dstueckm closed 1 year ago

dstueckm commented 1 year ago

Hello,

I have been attempting to analyze a large CyTOF dataset using CATALYST and ran into an issue during the "cluster" function step. The dataset consists of ~20 million cells, and I am clustering the data using 10 features. When I downsample to 10 million cells, I experience no issues, but when I attempt to cluster >15 million cells I encounter the following error: " integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))' " Any advice would be appreciated, Daniel

HelenaLC commented 1 year ago

After a quick google search, it looks to me like this is nothing CATALYST-related, but occurs generally across the spectrum (see e.g. here). Reposting from Hervé:

Long Vector derivatives are not supported at the moment. More precisely: the length of any Vector derivative must be <= .Machine$int.max (i.e. <= 2^31 - 1). This includes CharacterList, IntegerList, IRanges, GRanges, GPos, GRangesList, DNAString, DNAStringSet, SummarizedExperiment, and many more Vector derivatives. So, it might be worth doing a traceback() to see what function exactly causes the overflow (maybe something in FlowSOM?).

Otherwise, I unfortunately cannot help resolving this, at it's a general R thing, I guess... Two possible workarounds: