Closed NeuKon closed 2 years ago
Hi @NeuKon,
thanks for opening the issue and your suggestions. In your case you want to use site.pattern=FALSE
or even del.colgapsonly
from ape
, which does exactly what you want.
I added an extra example and hopefully improved the man pages in the development version. phyDat
usually stores identical columns of an alignment only once and keeps an index of the original positions. This saves memory and especially computations as these are usually need to be done only once for each site pattern.
I used subset
initially internally, where you often subset site patterns, and added [.phyDat
later on which has a more intuitive behaviour, like getting the first 100 columns of an alignment. I agree it would be good to make both interface agree, but I need to check if this will cause problems in my code and also in dependent packages, which can be quite a nightmare. So this will likely happen not before version 3.0.
Kind regards, Klaus
Hi,
Issue
subset.PhyDat
has the defaultsite.pattern=TRUE
whilst subsetting a phyDat object with[
has the defaultsite.pattern=FALSE
. (see https://github.com/KlausVigo/phangorn/blob/master/R/phyDat.R around line 536)Consequences
I create a bolean index by calculating the proportion of missing data for each site and apply a cut-off (see below) . Trimming an alignment using the
subset
and retains fewer sites than using[, index]
. I could not find a documentation of this behaviour and am not sure this inconsistency is wanted. The documentation ofsite.pattern
says "select site pattern or sites" so the meaning ofTRUE
orFALSE
is not immedieatly clear. Admittedly, there are examples that illustrate a difference between setting the argument toTRUE
orFALSE
.Calling
glance
on an object that was trimmed usingsubset
will determine a large number of "parsimony_informative_sites" (218 parsimony informative sites, 265 nchar, 261 unique sitte patterns). Calling glance on an object that was trimmed with the same criterion but using[
determines 0 parsimony informative sites, 261 nchar and 261 unique site patterns. Should the two objects not bear nearly the same, as they only differ in their dimensions by 4 sites? I don't understand how one object has many parsimony informative sites and the other has none.Suggestion
I suggest 1) changing the default behaviour of
[
to match that ofsubset
, and 2) adding a few words about the use of thesite.pattern
argument to the documentation as it makes subsetting phyDat data a bit counte-intuitive.Pseudo Example
# msa.raw created using read.phyDat from a FASTA alignment of amino acids # creates matrix of characters msa.char <- as.AAbin(msa.raw) |> as.character() # calculate proportion missing for each column/site missing.cols <- rep(NA, ncol(msa.char)) for (i in 1:ncol(msa.char)) { missing.cols[i] <- sum(msa.char[, i] == "-") / nrow(msa.char) } msa.select <- subset(msa.raw, select = missing.cols < 0.85) msa.select2 <- msa.raw[, missing.cols < 0.85] # not the same
Best wishes, Konrad
EDIT: typo, formatting and suggestion & pseudo-code
Hi Klaus,
Thank you very much for your reply, and for this amazing package. Now I understand how the difference between the square bracket notation and subset came to be.
ape::del.colgapsonly
was my first choice, but it only works for the class DNA.bin. Unfortunately, I am working with amino acid sequences.
Kind regards, Konrad
Hi @NeuKon
I give a workshop with Emmanuel at the moment. He introduced the amino acid models later to ape, so ape::del.colgapsonly
probably skipped through.
Kind regards,
Klaus
Hi,
Issue
subset.PhyDat
has the defaultsite.pattern=TRUE
whilst subsetting a phyDat object with[
has the defaultsite.pattern=FALSE
. (see https://github.com/KlausVigo/phangorn/blob/master/R/phyDat.R around line 536)Consequences
I create a bolean index by calculating the proportion of missing data for each site and apply a cut-off (see below) . Trimming an alignment using the
subset
and retains fewer sites than using[, index]
. I could not find a documentation of this behaviour and am not sure this inconsistency is wanted. The documentation ofsite.pattern
says "select site pattern or sites" so the meaning ofTRUE
orFALSE
is not immedieatly clear. Admittedly, there are examples that illustrate a difference between setting the argument toTRUE
orFALSE
.Calling
glance
on an object that was trimmed usingsubset
will determine a large number of "parsimony_informative_sites" (218 parsimony informative sites, 265 nchar, 261 unique sitte patterns). Calling glance on an object that was trimmed with the same criterion but using[
determines 0 parsimony informative sites, 261 nchar and 261 unique site patterns. Should the two objects not bear nearly the same, as they only differ in their dimensions by 4 sites? I don't understand how one object has many parsimony informative sites and the other has none.Suggestion
I suggest 1) changing the default behaviour of
[
to match that ofsubset
, and 2) adding a few words about the use of thesite.pattern
argument to the documentation as it makes subsetting phyDat data a bit counte-intuitive.Pseudo Example
Best wishes, Konrad
EDIT: typo, formatting and suggestion & pseudo-code