RAMClustR fails if run on large feature table.

hechth commented 2 years ago

If supplying a feature table with more than 55k entries, RAMClustR fails due to this issue in the ff package ref. I doubt this issue in ff will be fixed.

Since this allocated matrix is symmetric (I assume at least), and only the upper triangle is computed anyway, I think this computation could maybe be optimized in order to never have to store the actual full matrix in memory.

@cbroeckl if you are currently busy and don't have the time to address this issue I'd be happy to support and we will come up with an implementation to solve this.

https://github.com/cbroeckl/RAMClustR/blob/351243d9bb98da7ae684580c8e0fe2f911482190/R/ramclustR.R#L667

cbroeckl commented 2 years ago

@hechth - this is an issue i never really tried to tackle, but would love to have a solution for. I _was hitting issues for some time with .Machine$integer.max, as a square matrix of (Machine$integer.max^0.5 = 46341, feature matrix with > this value would be problematic) - i was assuming this was the issue rather than ff specifically. I would certainly be open to any fix you might suggest. While in the past this wasn't terribly limiting for me, with instrument developments toward increases sensitivity, selectivity, dynamic range, and resolution, i can imagine this is going to become quite limiting.

Thanks for helping to tackle this!

hechth commented 2 years ago

Thanks for the quick response! The command ffmat<-ff::ff(vmode="double", dim=c(n, n), initdata = 0) ##reset to 1 if necessary listed above is called with n being the number of features. The ff function is limited to .Machine$integer.max entries, which, as you say, becomes problematic with more than 46k features (seems like my estimation skills above aren't that great, so 46k instead of 55k).

As far as I can see, the matrix is used to store the correlations between the features, which will be a symmetric matrix - you already just compute the upper triangle I think and use a block-wise procedure for efficiency - I will have to have a more detailed look but I think allocating the large matrix can eventually be circumvented since I don't think the ff package will be fixed.

We will need some time to implement some tests to ensure that the program still behaves the same but we can come up with some fix afterwards. We can post it as a PR to this main repo to make those developments accessible for everyone and we can discuss implementation details etc. in the PR to find a solution that works for everyone :)

cbroeckl commented 2 years ago

I believe the issue occurs when coming out of ff and into a distance matrix format necessary for heirarchical clustering. If i recall, the distance matrix taken in forced a square matrix. My memory, however, is fallible, and i could be mis-remembering this. I think that i explored sparse form distance matrices and didn't find a solution. At one point, @meowcat forked ramclustR to try implementing a more memory efficient approach. If i recall, https://github.com/meowcat/fastliclust was used instead of the native fastclust algorithm. I can't remember where this ended though.

hechth commented 2 years ago

@cbroeckl I used a debugger in vscode and traced the failure in my case back to the ff function - could also be that there will be more problems coming up afterwards. Thank you very much for the hints, maybe @meowcat has some more hints?

cbroeckl / RAMClustR

RAMClustR fails if run on large feature table. #19