KlausVigo / phangorn

Phylogenetic analysis in R
http://klausvigo.github.io/phangorn/
203 stars 38 forks source link

Problem when plotting network out of big data set #15

Open jorgeamaya opened 7 years ago

jorgeamaya commented 7 years ago

I have a fasta file with 844 sequences of length 14844 bp. My attempts to plot a simple network have been unsuccessful due to the following error.

library(phangorn)
alignment <- read.phyDat(file="myalignment.fasta",format="fasta",type="DNA")
dm <- dist.hamming(alignment)
nnet <- neighborNet(dm)
Error in numeric(max(p)) : vector size cannot be NA/NaN
In addition: Warning message:
In splits2design(x) :
  integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'

I have tried both in Windows and Linux with up-to-date version of R. Is there a known explanation for this error? Is my data set to big?

On the other hand, are there plans to include a function that calculates Median Joining Networks as described in Bandelt et. al. (1999)?

KlausVigo commented 7 years ago

Hi @jorgeamaya, so far neighborNet is not yet working for large datasets, you have to use Splitstree for now. I am right now rewriting some the functions to make neighborNet it work for larger networks. I will let you know when I have something working. I have not yet looked into Median Joining Networks, maybe I give it a try if it is seems easy to implement. Cheers, Klaus

ErnestoHuicochea commented 6 years ago

Hello, I'm having the same problem that jorgeamaya because my nexus file is 1016 taxa by 324 pb.

Fortunately, I found this post and I wonder if already exists a correction to work with big datasets. If not, how many taxa handle neighborNet() function?

Thanks in advanced.

taprs commented 1 year ago

Bump! This would be super cool.

KlausVigo commented 1 year ago

Hi all, so far the neighborNet algorithm is a naive O(n⁴) implementation. This probably explains that it works until around 100 tips and soon after one runs of of memory. I am working over the summer to get the memory consumption down to O(n³) or a bit below, so that it scales up to a 1000 taxa.