emmanuelparadis / ape

analysis of phylogenetics and evolution
http://ape-package.ird.fr/
GNU General Public License v2.0
52 stars 11 forks source link

Tree too large for cophenetic #92

Closed osidaksjdmw closed 1 year ago

osidaksjdmw commented 1 year ago

Hi,

I am facing a problem with a Phylogenetic tree that I have. The tree has 26,980 tips and 26,978 internal nodes. I suspect that it might be too large for my laptop, which has 16GB of RAM. Whenever I try to run the cophenetic(tree) function in R, the program crashes. I have also attempted to run it as a job on my company's HPC cluster, but it fails with the following error message:

*** caught segfault *** address 0x2aa91fc53950, cause 'memory not mapped'

I noticed that in issue #64, you suggested computing the distances on a pairwise basis. However, I'm not sure how to apply this approach to the entire tree. Could you please explain how I can address either of these issues?

Thanks in advance!

emmanuelparadis commented 1 year ago

Hi,

I suggest you install ape from the present GH repos: it has an improved internal C code, and I removed the check with the error message "tree too big", so the limit is set by the system when allocating memory to the matrix. The error message you obtained from the HPC cluster suggests there is a bug in the old code (still currently in the CRAN version of ape).

You can estimate (approximately) the quantity of memory needed to store the intermediate matrix computed by dist.nodes() with (in GB):

R> n <- 26980
R> m <- 26978
R> 8 * (n + m)^2 / 1e9
[1] 23.29173

I suggest your multiply this number by 4 to evaluate the required minimum free RAM to complete your computations. Once the intermediate matrix is built, the final matrix will be smaller:

R> 8 * n^2 / 1e9
[1] 5.823363

You can release the above 23 GB by calling gc() (if needed).

I don't think you want to try the approach sketched on #64 with your full tree as it might a very long time.

Cheers,

osidaksjdmw commented 1 year ago

Thanks for the advice, I just checked and I can confirm that I am using ape installed from the github repository, it says ape_5.7-1.1 when I enter sessionInfo().

Here is the code I'm trying to run:

library(ape)
sessionInfo()

tree <- read.tree('bw_tree.nwk')
asv <- as.matrix(read.csv('bw.csv', row.names = 1))

x <- cophenetic(tree)
gc()

results <- picante::ses.mpd(samp = asv, dis = x, null.model = 'taxa.labels', runs = 999, iterations = 1000)
results$NRI = results$mpd.obs.z * (-1)
write.csv(results, 'BW_NRI.csv')

and the longer error message (if it helps):

*** caught segfault ***
address 0x2aaa24047950, cause 'memory not mapped'

Traceback:
 1: dist.nodes(x)
 2: cophenetic.phylo(tree)
 3: cophenetic(tree)
An irrecoverable exception occurred. R is aborting now ...

and some of the lines from my log file:

Exit Status: 139
NCPUs Requested: 64 
NCPUs Used: 64
Memory Requested: None
Memory Used: 82700kb
Vmem Used: 8921868kb
CPU Time Used: 00:00:18 
emmanuelparadis commented 1 year ago

The issue is because your tree is big enough so that the matrix created by dist.nodes() has more than 2.1 billion elements (the exact value of this threshold is given by .Machine$max.integer). This situation requires to adapt the C code (particularly, indices to the matrix have to be long integers). I've just pushed a version which hopefully fixes this problem (Version: 5.7-1.2).

Besides, I see that your script then calls picante which will permute the rows and columns of this big matrix, therefore creating a copy (inside the function). I suggest you manage to get at least 100 GB for this job. And for the number of CPUs, I don't think it matters here: it seems better you use a single CPU with lots of RAM.

Cheers,

osidaksjdmw commented 1 year ago

I'm having problems installing the latest version of ape, screenshot of the error attached.

I've also tried installing it on my computer but it doesn't work.

emmanuelparadis commented 1 year ago

I've corrected that. You can try again.

osidaksjdmw commented 1 year ago

Works like a charm now, thanks!