Closed Andreas-Bio closed 6 years ago
This is fixed in Biostrings 2.47.2 (by commit https://github.com/Bioconductor/Biostrings/commit/4b43c7de3ac053336f3096e4385e4172e4bf3511). Note that substitution matrices are expected to have non-positive values outside their diagonal. Otherwise stringDist()
can still return negative distances (e.g. with a substitution matrix like your subst
).
Oh cool, thanks. My next point on the wish-list would be to be able to use asymetric matrices. For example if one wants to develop a primer: If the primer has a N and the sequence a T it is no "mismatch". If the sequence has an N and the primer a T that would be a 3/4 mismatch. But I guess that is another story.
Well, I wouldn't have expected that the same person who (rightly) complained about a negative distance was also going to be the person who suggests the idea of an asymmetric distance ;-)
FWIW pairwiseAlignment()
accepts asymmetric substitution matrices and seems to do the right thing on them:
library(Biostrings)
subst <- nucleotideSubstitutionMatrix(match=0, mismatch=-1)
subst["N", "T"]
# [1] -0.75
subst["T", "N"]
# [1] -0.75
subst["N", "T"] <- 0
pattern <- DNAString("CCNCC")
subject <- DNAString("CCTCC")
pairwiseAlignment(pattern, subject, substitutionMatrix=subst)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: CCNCC
# subject: CCTCC
# score: 0
pairwiseAlignment(pattern, subject, substitutionMatrix=t(subst))
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: CCNCC
# subject: CCTCC
# score: -0.75
Since strindDist()
is basically a wrapper around pairwiseAlignment()
, it should not be too hard to adapt it to also accept asymmetric substitution matrices. I'll put this on the long-term TODO list but won't be able to get to this anytime soon. You're welcome to work on a patch and send me a pull request if you feel like to. Any follow-up on this should preferably happen in a new issue. Cheers.
For anyone interested, here's the code to generate the asymmetric substitution matrix that @andzandz11 describes:
subst <- nucleotideSubstitutionMatrix(match=0, mismatch=-1)
subst["N", "A"] <- 0
subst["N", "T"] <- 0
subst["N", "G"] <- 0
subst["N", "C"] <- 0
subst["M", "A"] <- 0
subst["M", "C"] <- 0
subst["R", "A"] <- 0
subst["R", "G"] <- 0
subst["W", "A"] <- 0
subst["W", "T"] <- 0
subst["S", "C"] <- 0
subst["S", "G"] <- 0
subst["Y", "C"] <- 0
subst["Y", "T"] <- 0
subst["K", "G"] <- 0
subst["K", "T"] <- 0
subst["V", "A"] <- 0
subst["V", "C"] <- 0
subst["V", "G"] <- 0
subst["H", "A"] <- 0
subst["H", "C"] <- 0
subst["H", "T"] <- 0
subst["D", "A"] <- 0
subst["D", "G"] <- 0
subst["D", "T"] <- 0
subst["B", "C"] <- 0
subst["B", "G"] <- 0
subst["B", "T"] <- 0
So, if your pattern
has ambiguity, your subject
score will not be penalized for a "mismatch".
pat <- DNAString("MRVHD")
sub <- DNAString("AAAAA")
pairwiseAlignment(pattern = pat, subject = sub, substitutionMatrix = subst)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: MRVHD
subject: AAAAA
score: 0
Note that this does not handle asymmetric cases where both pattern
and subject
have ambiguous letters. i.e. pat = DNAString("W"); sub = DNAString("D")
should give a different score than pat = DNAString("D"); sub = DNAString("W")
And here's a function to create a full asymmetric nucleotide substitution matrix, including cases where both pattern
and subject
have ambiguous IUPAC encodings:
asym_NSM <- function(match = 1, mismatch = 0, baseOnly = FALSE, type = "DNA") {
"%safemult%" <- function(x, y) ifelse(is.infinite(x) & y == 0, 0, x * y)
type <- match.arg(type, c("DNA", "RNA"))
if (!isSingleNumber(match) || !isSingleNumber(mismatch)) {
stop("'match' and 'mismatch' must be non-missing numbers")
}
if (baseOnly) {
letters <- IUPAC_CODE_MAP[DNA_BASES]
}
else {
letters <- IUPAC_CODE_MAP
}
if (type == "RNA") {
names(letters) <- chartr("T", "U", names(letters))
}
nLetters <- length(letters)
splitLetters <- strsplit(letters,split="")
submat <- matrix(0, nrow = nLetters, ncol = nLetters, dimnames = list(names(letters),
names(letters)))
for (i in 1:nLetters) {
for (j in i:nLetters) {
submat[i,j] <- mean(outer(splitLetters[[i]], splitLetters[[j]], "%in%"))
submat[j,i] <- mean(outer(splitLetters[[j]], splitLetters[[i]], "%in%"))
}
}
abs(match) * submat - abs(mismatch) %safemult% (1 - submat)
}
This function is based on nucleotideSubstitutionMatrix()
.
For more context, see my question and answer on Bioinformatics.SE.
subst <- matrix(nrow=4,ncol=4,dimnames=list(c("A","C","T","G"),c("A","C","T","G")),data=1) diag(subst) <- 0 subst["A","G"] <- 0.75 subst["T","C"] <- 0.75 subst["G","A"] <- 0.75 subst["C","T"] <- 0.75
stringDist(DNAStringSet(c("CGTCTTA", "GACA" )), method = "substitutionMatrix", type = "global", substitutionMatrix = subst, diag = T, upper = T, gapOpening = 1)
1 0.0 -0.5 2 -0.5 0.0
Why is there even such a thing as a negative distance? Makes no sense to me.