Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
57 stars 16 forks source link

stringDist returns negative values #3

Closed Andreas-Bio closed 6 years ago

Andreas-Bio commented 6 years ago

subst <- matrix(nrow=4,ncol=4,dimnames=list(c("A","C","T","G"),c("A","C","T","G")),data=1) diag(subst) <- 0 subst["A","G"] <- 0.75 subst["T","C"] <- 0.75 subst["G","A"] <- 0.75 subst["C","T"] <- 0.75

stringDist(DNAStringSet(c("CGTCTTA", "GACA" )), method = "substitutionMatrix", type = "global", substitutionMatrix = subst, diag = T, upper = T, gapOpening = 1)

 1    2

1 0.0 -0.5 2 -0.5 0.0

Why is there even such a thing as a negative distance? Makes no sense to me.

hpages commented 6 years ago

This is fixed in Biostrings 2.47.2 (by commit https://github.com/Bioconductor/Biostrings/commit/4b43c7de3ac053336f3096e4385e4172e4bf3511). Note that substitution matrices are expected to have non-positive values outside their diagonal. Otherwise stringDist() can still return negative distances (e.g. with a substitution matrix like your subst).

Andreas-Bio commented 6 years ago

Oh cool, thanks. My next point on the wish-list would be to be able to use asymetric matrices. For example if one wants to develop a primer: If the primer has a N and the sequence a T it is no "mismatch". If the sequence has an N and the primer a T that would be a 3/4 mismatch. But I guess that is another story.

hpages commented 6 years ago

Well, I wouldn't have expected that the same person who (rightly) complained about a negative distance was also going to be the person who suggests the idea of an asymmetric distance ;-)

FWIW pairwiseAlignment() accepts asymmetric substitution matrices and seems to do the right thing on them:

library(Biostrings)
subst <- nucleotideSubstitutionMatrix(match=0, mismatch=-1)
subst["N", "T"]
# [1] -0.75
subst["T", "N"]
# [1] -0.75
subst["N", "T"] <- 0
pattern <- DNAString("CCNCC")
subject <- DNAString("CCTCC")
pairwiseAlignment(pattern, subject, substitutionMatrix=subst)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: CCNCC 
# subject: CCTCC 
# score: 0 
pairwiseAlignment(pattern, subject, substitutionMatrix=t(subst))
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: CCNCC 
# subject: CCTCC 
# score: -0.75 

Since strindDist() is basically a wrapper around pairwiseAlignment(), it should not be too hard to adapt it to also accept asymmetric substitution matrices. I'll put this on the long-term TODO list but won't be able to get to this anytime soon. You're welcome to work on a patch and send me a pull request if you feel like to. Any follow-up on this should preferably happen in a new issue. Cheers.

acvill commented 2 years ago

For anyone interested, here's the code to generate the asymmetric substitution matrix that @andzandz11 describes:

subst <- nucleotideSubstitutionMatrix(match=0, mismatch=-1)
subst["N", "A"] <- 0
subst["N", "T"] <- 0
subst["N", "G"] <- 0
subst["N", "C"] <- 0
subst["M", "A"] <- 0
subst["M", "C"] <- 0
subst["R", "A"] <- 0
subst["R", "G"] <- 0
subst["W", "A"] <- 0
subst["W", "T"] <- 0
subst["S", "C"] <- 0
subst["S", "G"] <- 0
subst["Y", "C"] <- 0
subst["Y", "T"] <- 0
subst["K", "G"] <- 0
subst["K", "T"] <- 0
subst["V", "A"] <- 0
subst["V", "C"] <- 0
subst["V", "G"] <- 0
subst["H", "A"] <- 0
subst["H", "C"] <- 0
subst["H", "T"] <- 0
subst["D", "A"] <- 0
subst["D", "G"] <- 0
subst["D", "T"] <- 0
subst["B", "C"] <- 0
subst["B", "G"] <- 0
subst["B", "T"] <- 0

So, if your pattern has ambiguity, your subject score will not be penalized for a "mismatch".

pat <- DNAString("MRVHD")
sub <- DNAString("AAAAA")
pairwiseAlignment(pattern = pat, subject = sub, substitutionMatrix = subst)

Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: MRVHD
subject: AAAAA
score: 0 

Note that this does not handle asymmetric cases where both pattern and subject have ambiguous letters. i.e. pat = DNAString("W"); sub = DNAString("D") should give a different score than pat = DNAString("D"); sub = DNAString("W")

acvill commented 2 years ago

And here's a function to create a full asymmetric nucleotide substitution matrix, including cases where both pattern and subject have ambiguous IUPAC encodings:

asym_NSM <- function(match = 1, mismatch = 0, baseOnly = FALSE, type = "DNA") {
  "%safemult%" <- function(x, y) ifelse(is.infinite(x) & y == 0, 0, x * y)
  type <- match.arg(type, c("DNA", "RNA"))
  if (!isSingleNumber(match) || !isSingleNumber(mismatch)) {
    stop("'match' and 'mismatch' must be non-missing numbers")
  }
  if (baseOnly) {
    letters <- IUPAC_CODE_MAP[DNA_BASES]
  }
  else {
    letters <- IUPAC_CODE_MAP
  }
  if (type == "RNA") {
    names(letters) <- chartr("T", "U", names(letters))
  }
  nLetters <- length(letters)
  splitLetters <- strsplit(letters,split="")
  submat <- matrix(0, nrow = nLetters, ncol = nLetters, dimnames = list(names(letters), 
  names(letters)))
  for (i in 1:nLetters) {
    for (j in i:nLetters) {
      submat[i,j] <- mean(outer(splitLetters[[i]], splitLetters[[j]], "%in%"))
      submat[j,i] <- mean(outer(splitLetters[[j]], splitLetters[[i]], "%in%"))
    }
  }
  abs(match) * submat - abs(mismatch) %safemult% (1 - submat)
}

This function is based on nucleotideSubstitutionMatrix().

https://github.com/Bioconductor/Biostrings/blob/4f9031a1bc0e102148060cd39aa441d812d75ce7/R/pairwiseAlignment.R#L16-L35

For more context, see my question and answer on Bioinformatics.SE.