daijiang / phyr

Functions for phylogenetic analyses
https://daijiang.github.io/phyr/
GNU General Public License v3.0
30 stars 10 forks source link

Memory efficient version of phylogenetic diversity metrics #4

Open rdinnager opened 6 years ago

rdinnager commented 6 years ago

A PSV/PSE/PSC/etc version that can handle really big phylogenies. In the past I have tried to calculate PSV on a phylogeny with several hundred thousand tips, but R will give a 'cannot allocate vector of size 150 GB', or some other ridiculously large value in this case (presumably because it is trying to allocate a huge phylogenetic covariance matrix). This data is not so unusual anymore, with large metagenomics data, so I think a memory efficient version would be really useful. I was think it could be done using the bigmemory and bigalgebra packages?

daijiang commented 6 years ago

Thanks @rdinnager for the issue!

I will update PSV later with c++, hopefully c++ will manage memory better. After then, I will test it with large phylogeny and see what do we need to handle such large trees.

rdinnager commented 6 years ago

Okay, that sounds like a good plan.

daijiang commented 6 years ago

Hi @rdinnager , I updated psv with c++. It is now faster than picante::psv. But I am not sure whether it can handle several hundred thousand tips (probably not). The main bottleneck is the memory needed to store the species by species phylogenetic var-cov matrix for such many tips...

lucasnell commented 6 years ago

Hey @daijiang @rdinnager , I'd recommend big.memory since it's pretty simple to interface with using Rcpp (see here) and because it allows you to store matrices on disk. The latter is pretty important bc even a direct C++ implementation with no copying of such large matrices will deplete RAM on most computers. I've played around with it, and it seemed pretty intuitive.

daijiang commented 6 years ago

Thanks @lucasnell . I will take a look at it later. Currently, the c++ version can handle 20k by 20k matrix on my laptop. It is probably enough for most ecological studies. Big.memory is definitely useful beyond this number.