Inconsistent behaviour between subset.phyDat and "[.phyDat" with consequences for glance

NeuKon commented 2 years ago

Hi,

Issue

subset.PhyDat has the default site.pattern=TRUE whilst subsetting a phyDat object with [ has the default site.pattern=FALSE. (see https://github.com/KlausVigo/phangorn/blob/master/R/phyDat.R around line 536)

Consequences

I create a bolean index by calculating the proportion of missing data for each site and apply a cut-off (see below) . Trimming an alignment using the subset and retains fewer sites than using [, index]. I could not find a documentation of this behaviour and am not sure this inconsistency is wanted. The documentation of site.pattern says "select site pattern or sites" so the meaning of TRUE or FALSE is not immedieatly clear. Admittedly, there are examples that illustrate a difference between setting the argument to TRUE or FALSE.

Calling glance on an object that was trimmed using subset will determine a large number of "parsimony_informative_sites" (218 parsimony informative sites, 265 nchar, 261 unique sitte patterns). Calling glance on an object that was trimmed with the same criterion but using [ determines 0 parsimony informative sites, 261 nchar and 261 unique site patterns. Should the two objects not bear nearly the same, as they only differ in their dimensions by 4 sites? I don't understand how one object has many parsimony informative sites and the other has none.

Suggestion

I suggest 1) changing the default behaviour of [ to match that of subset, and 2) adding a few words about the use of the site.pattern argument to the documentation as it makes subsetting phyDat data a bit counte-intuitive.

Pseudo Example

# msa.raw created using read.phyDat from a FASTA alignment of amino acids
# creates matrix of characters
msa.char <- as.AAbin(msa.raw) |> as.character()

# calculate proportion missing for each column/site
missing.cols <- rep(NA, ncol(msa.char))
for (i in 1:ncol(msa.char)) {
  missing.cols[i] <- sum(msa.char[, i] == "-") / nrow(msa.char)
}

msa.select <- subset(msa.raw, select = missing.cols < 0.85)
msa.select2 <- msa.raw[, missing.cols < 0.85] # not the same

Best wishes, Konrad

EDIT: typo, formatting and suggestion & pseudo-code

KlausVigo commented 2 years ago

Hi @NeuKon,

thanks for opening the issue and your suggestions. In your case you want to use site.pattern=FALSE or even del.colgapsonly from ape, which does exactly what you want. I added an extra example and hopefully improved the man pages in the development version. phyDat usually stores identical columns of an alignment only once and keeps an index of the original positions. This saves memory and especially computations as these are usually need to be done only once for each site pattern.
I used subset initially internally, where you often subset site patterns, and added [.phyDat later on which has a more intuitive behaviour, like getting the first 100 columns of an alignment. I agree it would be good to make both interface agree, but I need to check if this will cause problems in my code and also in dependent packages, which can be quite a nightmare. So this will likely happen not before version 3.0.

Kind regards, Klaus

Hi,

Issue

subset.PhyDat has the default site.pattern=TRUE whilst subsetting a phyDat object with [ has the default site.pattern=FALSE. (see https://github.com/KlausVigo/phangorn/blob/master/R/phyDat.R around line 536)

Consequences

I create a bolean index by calculating the proportion of missing data for each site and apply a cut-off (see below) . Trimming an alignment using the subset and retains fewer sites than using [, index]. I could not find a documentation of this behaviour and am not sure this inconsistency is wanted. The documentation of site.pattern says "select site pattern or sites" so the meaning of TRUE or FALSE is not immedieatly clear. Admittedly, there are examples that illustrate a difference between setting the argument to TRUE or FALSE.

Calling glance on an object that was trimmed using subset will determine a large number of "parsimony_informative_sites" (218 parsimony informative sites, 265 nchar, 261 unique sitte patterns). Calling glance on an object that was trimmed with the same criterion but using [ determines 0 parsimony informative sites, 261 nchar and 261 unique site patterns. Should the two objects not bear nearly the same, as they only differ in their dimensions by 4 sites? I don't understand how one object has many parsimony informative sites and the other has none.

Suggestion

I suggest 1) changing the default behaviour of [ to match that of subset, and 2) adding a few words about the use of the site.pattern argument to the documentation as it makes subsetting phyDat data a bit counte-intuitive.

Pseudo Example
# msa.raw created using read.phyDat from a FASTA alignment of amino acids
# creates matrix of characters
msa.char <- as.AAbin(msa.raw) |> as.character()

# calculate proportion missing for each column/site
missing.cols <- rep(NA, ncol(msa.char))
for (i in 1:ncol(msa.char)) {
  missing.cols[i] <- sum(msa.char[, i] == "-") / nrow(msa.char)
}

msa.select <- subset(msa.raw, select = missing.cols < 0.85)
msa.select2 <- msa.raw[, missing.cols < 0.85] # not the same
Best wishes, Konrad

EDIT: typo, formatting and suggestion & pseudo-code

NeuKon commented 2 years ago

Hi Klaus,

Thank you very much for your reply, and for this amazing package. Now I understand how the difference between the square bracket notation and subset came to be.

ape::del.colgapsonly was my first choice, but it only works for the class DNA.bin. Unfortunately, I am working with amino acid sequences.

Kind regards, Konrad

KlausVigo commented 2 years ago

Hi @NeuKon I give a workshop with Emmanuel at the moment. He introduced the amino acid models later to ape, so ape::del.colgapsonly probably skipped through. Kind regards, Klaus

KlausVigo / phangorn