alexeckert / parallelDist

R Package: Parallel Distance Matrix Computation using Multiple Threads
GNU General Public License v2.0
49 stars 9 forks source link

Any ability to convert dist object to a matrix for calculations? #4

Open kbannarm opened 5 years ago

kbannarm commented 5 years ago

The package produced a very large dist object of length 1479489606. Which is fantastic from a computing perspective. However, because it is not a matrix (and as.matrix doesn't work on something so big) I am having issues doing basic matrix math. Any ideas of how to work around this? Thanks!

alexeckert commented 5 years ago

Hi @kbannarm,

this is a general problem when working with distance matrices. The dist object only stores a lower/upper triangular matrix for that reason. When we assume that each value of the dist object takes 8 byte, we end up with approximately 11.8 Gb. If you call as.matrix R tries to create a new object which is supposedly at least twice that big. Then you would need approx. 36 Gb of RAM (or at least a large enough swap partition), since the old dist object still exists.

First, I would save the dist object on disc to avoid recomputation. If your machine doesn't have enough RAM to work with a matrix which takes up ~ 24 Gb, you probably need to work with the dist object, which essentially is a vector of the triangular matrix with some meta data. Trying to convert the dist object to a matrix object might be not straightforward if you have to do it inplace. Another alternative is to use the bigmemory package(s) where memory-mapped files can be used to swap matrix data to the disc (needed for large problems if your RAM is too small).

Possible additions for parallelDist could be to add matrix as additional alternative return value or the option to write objects directly to disk (maybe some support for bigmemory).

JimTD commented 4 years ago

Hello kbannarm and Alexeckert, I am not sure if this thread is still open, but I am having a similar issue with a 6.9 GB dist object produced by parDist. my hope was to save this item to disc, but it seems that R cannot allocate a vector of this size - either to a successive function (hclust) or saved to disk.

Given that you have already dealt with this question - can you please provide some guidance on how to remove the very large dist object from RAM, while still being able to save (or use) the information?

warm regards