Closed francoiskroll closed 3 years ago
Here's a dirty fix below, probably quite specific to my case. Do let me know if there's an option I missed somewhere!
Example
reference <- 'GGY'
query <- 'GGT'
align <- msa(c(reference, query), type='dna', 'ClustalOmega')
msaConsensusSequence(align)
[1] "GG?"
Now using function msaConsensusFlexible
(below)
msaConsensusFlexible(align)
[1] "GGY"
Code
# build look-up table for flexible nucleotides
fnu <- c('R', 'Y', 'S', 'W', 'K', 'M', 'B', 'D', 'H', 'V', 'N') # flexible nucleotides
lut <- as.data.frame(matrix(nrow=4, ncol=length(fnu)))
colnames(lut) <- fnu
lut$R[1:2] <- c('A', 'G')
lut$Y[1:2] <- c('C', 'T')
lut$S[1:2] <- c('G', 'C')
lut$W[1:2] <- c('A', 'T')
lut$K[1:2] <- c('G', 'T')
lut$M[1:2] <- c('A', 'C')
lut$B[1:3] <- c('C', 'G', 'T')
lut$D[1:3] <- c('A', 'G', 'T')
lut$H[1:3] <- c('A', 'C', 'T')
lut$V[1:3] <- c('A', 'C', 'G')
lut$N[1:4] <- c('A', 'C', 'G', 'T')
# function msaConsensusFlexible
# use instead of msaConsensusSequence, will return the consensus sequence of an alignment=
# will replace ? in the consensus sequence which were thrown because of the presence of a flexible nucleotide, where the alignment was fine (eg. Y vs T)
msaConsensusFlexible <- function (alignment){
conm <- consensusMatrix(alignment)
cons <- msaConsensusSequence(alignment)
sapply(1:ncol(conm), function(pos) { # for each position of the alignment
al <- conm[,pos] # get the alignment matrix at that position
# count number of flexible nucleotides at that position
nfnu <- sum(al[which(names(al) %in% fnu)])
if (nfnu==0) { # if no flexible nucleotide at that position, do nothing
return(NULL)
}
if (nfnu>0) { # if any flexible nucleotide at that position
# get the flexible nucleotide
f <- names(which(al[which(names(al) %in% fnu)] > 0))
# get the standard nucleotide
s <- names(which(al[which(!names(al) %in% fnu)] > 0))
# is the alignment valid? Check in the look-up table; eg. T vs Y is ok
if (!s %in% lut[,f]) { # if not -- do NOT modify the consensus sequence at that position (should leave ?)
return(NULL)
}
else if (s %in% lut[,f]) { # if yes -- modify the consensus sequence at that position (should put the flexible nucleotide instead of the ?)
substr(cons, pos, pos) <<- f
}
}
})
return(cons)
}
Thanks for your comments and your neat solution, @francoiskroll! Actually, the package includes this already. msaConsensusSequence()
is just a unified frontend to two functions:
consensusString()
function from the Biostrings
package,msa
package.The Biostrings
frontend is actually chosen by default, and this function actually implements the functionality you are looking for. I admit it's hidden in the arguments that are forwarded to the consensusString()
function, but the following code works on my system:
library(msa)
reference <- 'GGY' # my reference with flexible nucleotide Y
query <- 'GGT' # my query
# alignment using ClustalOmega
alignment <- msa(c(reference, query), type='dna', 'ClustalOmega')
# looking at the consensus sequence
msaConsensusSequence(alignment, ambiguityMap=IUPAC_CODE_MAP, threshold=0.25)
Ah brilliant! Thanks a lot. Works here as well.
Could you link me to an explanation of that threshold
? It does not look like the same upperlower threshold described in the documentation?
Disclaimer!
For anyone reading this thread; you probably do not want to align DNA sequences with ClustalOmega as I am doing here! This was an error as it uses an amino acid subsitution matrix. See https://github.com/UBod/msa/issues/13.
msa(c(reference, query), type='dna')
msa(c(reference, query), type='dna', method='ClustalW')
msa(c(reference, query), type='dna', substitutionMatrix='clustalw')
msa(c(reference, query), type='dna', substitutionMatrix='iub')
etc.
are all OK for DNA sequences (with different methods/substitution matrices).
The threshold
argument is from the consensusString()
function that is provided by the Biostrings
package. Let me quote the corresponding help text (?consensusString
):
"threshold
: The minimum probability threshold for an agreement to be declared. When
ambiguityMap
is a single character, threshold
is a single number in (0, 1].
When ambiguityMap
is a named character vector (e.g. IUPAC_CODE_MAP
),
threshold
is a single number in (0, 1/sum(nchar(ambiguityMap) == 1)
]."
Thanks for a really helpful msa R package!
Say I align to a reference that has 'flexible' nucleotides. For example Y can be C or T (all codes here https://www.bioinformatics.org/sms/iupac.html). Example:
So aligning
GGT
toGGY
returnsGG?
as consensus sequence. However, I would like it to returnGGY
. Is there any way I can do that? I cannot find an option in the documentation.Thanks for your help!