emmanuelparadis / pegas

Population and Evolutionary Genetics Analysis System
GNU General Public License v2.0
28 stars 10 forks source link

rmst on a list of distance matrices #85

Closed naurasd closed 8 months ago

naurasd commented 8 months ago

Hi,

I am trying to create a list of random minimum spanning trees for a list of distance matrices using the rmst() function.

So I created a list of DNAbin objects called seqs_list. Then I extract the haplotypes for all the DNAbin objects using the haplotype() function and create a list of distance matrices called d for all the haplotypes objects. I then want to continue to apply the rmst() function to the list of distance matrices, but I fail with the error shown below:

h<-lapply(seqs_list, function(i) haplotype(i,labels=labels(i)))
d <- lapply(h, function(i) dist.dna(i, "N"))
nt <- lapply(d, function(i) rmst(i, quiet = TRUE))

Error in apply(mat, 1, sort) : dim(X) must have a positive length

I tried to use a for loop just to try an alternative, but it gave the same error:

nt<-list()
for(i in 1:length(d)) { 
  nt[[i]]<-rmst(d[[i]], quiet = TRUE)
}

Error in apply(mat, 1, sort) : dim(X) must have a positive length

I know that this is not a pegas issue, but I am trying to understand what the issue is. Because when I access one specific distance matrix from d, then this works, so my distance matrices in d are of the correct format:

 nt<-rmst(d[[1]])

Iteration: 25   Number of links: 134

Do you have an idea why lapply doesn't work with rmst() but it works with haplotype() and dist.dna?

Thank you Nauras

emmanuelparadis commented 8 months ago

Hi @naurasd,

Here's a script simulating your analyses:

library(pegas)
data(woodmouse)

N <- 4L # number of data sets
DATA <- replicate(N, woodmouse[sample.int(15), ], FALSE)

H <- lapply(DATA, haplotype)
D <- lapply(DATA, dist.dna, model = "N")
R <- lapply(D, rmst, quiet = TRUE)

layout(matrix(1:N, 2, 2))
lapply(R, plot)

It works fine for me. Did you upgrade pegas? Version 1.3 was released recently. Cheers, Emmanuel

naurasd commented 8 months ago

Hi @emmanuelparadis

I appreciate the quick reply, as always.

After checking again and again, I found the issue. When I am running rmst, it actually works but fails at a certain point:

> nt <- lapply(d, rmst)
Iteration: 29   Number of links: 135
Iteration: 5   Number of links: 13
Less than 6 observations: doing complete enumeration.
Iteration: 6
Iteration: 4   Number of links: 5
Iteration: 6   Number of links: 16
Iteration: 6   Number of links: 9
Iteration: 5   Number of links: 6
Less than 6 observations: doing complete enumeration.
Iteration: 120
Iteration: 7   Number of links: 11
Less than 6 observations: doing complete enumeration.
Iteration: 120
Less than 6 observations: doing complete enumeration.
Iteration: 6
Iteration: 4   Number of links: 7
Less than 6 observations: doing complete enumeration.
Iteration: 6
Iteration: 6   Number of links: 16
Less than 6 observations: doing complete enumeration.
Iteration: 120
Iteration: 6   Number of links: 14
Iteration: 7   Number of links: 10
Iteration: 5   Number of links: 6
Less than 6 observations: doing complete enumeration.
Iteration: 6
Less than 6 observations: doing complete enumeration.
Iteration: 24
Iteration: 4   Number of links: 5
Iteration: 7   Number of links: 8
Less than 6 observations: doing complete enumeration.
Iteration: 1Error in apply(mat, 1, sort) : dim(X) must have a positive length

So rmst is running properly, but it throws an error at the first instance where I only have two haplotypes, as is the case for MOTU1154:

>str(d)
List of 143
...
$ MOTU1143: 'dist' num [1:28] 3 4 3 4 5 15 14 1 2 1 ...
  ..- attr(*, "Size")= int 8
  ..- attr(*, "Labels")= chr [1:8] "I" "II" "III" "IV" ...
  ..- attr(*, "Upper")= logi FALSE
  ..- attr(*, "Diag")= logi FALSE
  ..- attr(*, "call")= language FUN(x = X[[i]], model = "N")
  ..- attr(*, "method")= chr "N"
 $ MOTU1154: 'dist' num 12
  ..- attr(*, "Size")= int 2
  ..- attr(*, "Labels")= chr [1:2] "I" "II"
  ..- attr(*, "Upper")= logi FALSE
  ..- attr(*, "Diag")= logi FALSE
  ..- attr(*, "call")= language FUN(x = X[[i]], model = "N")
  ..- attr(*, "method")= chr "N"

As you can see, MOTU1154 only has two haplotypes, and therefore the dimensions get messed up. This is where rmst fails. Does this mean I cannot handle scenarios where I only have two haplotypes with pegas?

Second small question: You are running dist.dna on DATA, not on the haplotypes in H, which is how it is detailed in your documentation. Doesn't seem to be a problem? What's the difference, other than labels beeing different during plotting?

Merci Nauras

emmanuelparadis commented 8 months ago

Great that you were able to diagnose this! pegas can handle data with 2 haplotypes: mst() does it and I'll fix rmst() to return the MST when n=2.

In the meantime, you can add something like if (length(d) == 1) THEN call mst() ELSE call rmst(). Maybe also add a check if n=1 (if needed) since mst() would throw an error.

Second small question: You are running dist.dna on DATA, not on the haplotypes in H, which is how it is detailed in your documentation. Doesn't seem to be a problem?

My wrong: I meant D <- lapply(H, dist.dna, model = "N")

What's the difference, other than labels beeing different during plotting?

None with these data because each sequence of the woodmouse dataset defines (unambiguously) a distinct haplotype. Cheers, Emmanuel

naurasd commented 8 months ago

Thanks, appreciate the hard work. Hope these small issues will help making the package more efficient!

let me know once the fix has been made, will download the new version then!

emmanuelparadis commented 8 months ago

It's fixed and pushed here.

naurasd commented 8 months ago

You are the best, Merci!