Consensus sequence when aligning to 'flexible' nucleotides (Y, K, R, D, ...)

francoiskroll commented 3 years ago

Thanks for a really helpful msa R package!

Say I align to a reference that has 'flexible' nucleotides. For example Y can be C or T (all codes here https://www.bioinformatics.org/sms/iupac.html). Example:

reference <- 'GGY' # my reference with flexible nucleotide Y
query <- 'GGT' # my query

# alignment using ClustalOmega
alignment <- msa(c(reference, query), type='dna', 'ClustalOmega') # ! you probably do NOT want to use this, this was an error; see my comment below

# looking at the consensus sequence
msaConsensusSequence(alignment)

So aligning GGT to GGY returns GG? as consensus sequence. However, I would like it to return GGY. Is there any way I can do that? I cannot find an option in the documentation.

Thanks for your help!

francoiskroll commented 3 years ago

Here's a dirty fix below, probably quite specific to my case. Do let me know if there's an option I missed somewhere!

Example

reference <- 'GGY'
query <- 'GGT'

align <- msa(c(reference, query), type='dna', 'ClustalOmega')

msaConsensusSequence(align)

[1] "GG?"

Now using function msaConsensusFlexible (below) msaConsensusFlexible(align)

[1] "GGY"

Code

# build look-up table for flexible nucleotides
fnu <- c('R', 'Y', 'S', 'W', 'K', 'M', 'B', 'D', 'H', 'V', 'N') # flexible nucleotides
lut <- as.data.frame(matrix(nrow=4, ncol=length(fnu)))
colnames(lut) <- fnu
lut$R[1:2] <- c('A', 'G')
lut$Y[1:2] <- c('C', 'T')
lut$S[1:2] <- c('G', 'C')
lut$W[1:2] <- c('A', 'T')
lut$K[1:2] <- c('G', 'T')
lut$M[1:2] <- c('A', 'C')
lut$B[1:3] <- c('C', 'G', 'T')
lut$D[1:3] <- c('A', 'G', 'T')
lut$H[1:3] <- c('A', 'C', 'T')
lut$V[1:3] <- c('A', 'C', 'G')
lut$N[1:4] <- c('A', 'C', 'G', 'T')

# function msaConsensusFlexible
  # use instead of msaConsensusSequence, will return the consensus sequence of an alignment=
  # will replace ? in the consensus sequence which were thrown because of the presence of a flexible nucleotide, where the alignment was fine (eg. Y vs T)
msaConsensusFlexible <- function (alignment){
  conm <- consensusMatrix(alignment)
  cons <- msaConsensusSequence(alignment)

  sapply(1:ncol(conm), function(pos) { # for each position of the alignment
    al <- conm[,pos] # get the alignment matrix at that position
    # count number of flexible nucleotides at that position
    nfnu <- sum(al[which(names(al) %in% fnu)])

    if (nfnu==0) { # if no flexible nucleotide at that position, do nothing
      return(NULL)
    }

    if (nfnu>0) { # if any flexible nucleotide at that position
      # get the flexible nucleotide
      f <- names(which(al[which(names(al) %in% fnu)] > 0))
      # get the standard nucleotide
      s <- names(which(al[which(!names(al) %in% fnu)] > 0))

      # is the alignment valid? Check in the look-up table; eg. T vs Y is ok
      if (!s %in% lut[,f]) { # if not -- do NOT modify the consensus sequence at that position (should leave ?)
        return(NULL)
      }

      else if (s %in% lut[,f]) { # if yes -- modify the consensus sequence at that position (should put the flexible nucleotide instead of the ?)
        substr(cons, pos, pos) <<- f
      }
    }

  })

  return(cons)

}

UBod commented 3 years ago

Thanks for your comments and your neat solution, @francoiskroll! Actually, the package includes this already. msaConsensusSequence() is just a unified frontend to two functions:

to the consensusString() function from the Biostringspackage,
to a custom function implemented inside the msa package.

The Biostrings frontend is actually chosen by default, and this function actually implements the functionality you are looking for. I admit it's hidden in the arguments that are forwarded to the consensusString() function, but the following code works on my system:

library(msa)
reference <- 'GGY' # my reference with flexible nucleotide Y
query <- 'GGT' # my query

# alignment using ClustalOmega
alignment <- msa(c(reference, query), type='dna', 'ClustalOmega')

# looking at the consensus sequence
msaConsensusSequence(alignment, ambiguityMap=IUPAC_CODE_MAP, threshold=0.25)

francoiskroll commented 3 years ago

Ah brilliant! Thanks a lot. Works here as well.

Could you link me to an explanation of that threshold? It does not look like the same upperlower threshold described in the documentation?

Disclaimer!

For anyone reading this thread; you probably do not want to align DNA sequences with ClustalOmega as I am doing here! This was an error as it uses an amino acid subsitution matrix. See https://github.com/UBod/msa/issues/13.

msa(c(reference, query), type='dna')
msa(c(reference, query), type='dna', method='ClustalW')
msa(c(reference, query), type='dna', substitutionMatrix='clustalw')
msa(c(reference, query), type='dna', substitutionMatrix='iub')

etc.

are all OK for DNA sequences (with different methods/substitution matrices).

UBod commented 3 years ago

The threshold argument is from the consensusString() function that is provided by the Biostrings package. Let me quote the corresponding help text (?consensusString):

"threshold: The minimum probability threshold for an agreement to be declared. When ambiguityMap is a single character, threshold is a single number in (0, 1]. When ambiguityMap is a named character vector (e.g. IUPAC_CODE_MAP), threshold is a single number in (0, 1/sum(nchar(ambiguityMap) == 1)]."

UBod / msa

Consensus sequence when aligning to 'flexible' nucleotides (Y, K, R, D, ...) #12