Open jayoung opened 5 years ago
PS no need to hurry to implement this for my benefit - I made my own functions that (I think) do what I want. Not elegant, and I'm sure not robust to the user doing unintended things, but they work at least on my own alignments.
Still might be a useful thing to implement in Biostrings (although maybe I should be using some other package for this?)
if you're curious:
## myAln is a DNAStringSet with a gapped alignment
getCodons <- function(myAln) {
seqs <- as.character(myAln)
len <- width(myAln)[1]
starts <- seq(from=1, to=len, by=3)
ends <- starts + 2
myViews <- lapply(myAln, function(x) {
Views(x, starts, ends)
})
myCodons <- lapply(myViews, function(x) {
as.character(DNAStringSet(x))
})
myCodons
}
## myCodons is a simple character vector, each item is a codon (like one of the items in a list generated by getCodons)
translateCodons <- function(myCodons, unknownCodonTranslatesTo="-") {
## make new genetic code
gapCodon <- "-"
names(gapCodon) <- "---"
my_GENETIC_CODE <- c(GENETIC_CODE, gapCodon)
## translate the codons
pep <- my_GENETIC_CODE[myCodons]
## check for codons that were not possible to translate, e.g. frameshift codons
if (sum(is.na(pep))>0) {
cat("\nwarning - there were codons I could not translate. Using this character", unknownCodonTranslatesTo, "\n\n")
pep[ which(is.na(pep)) ] <- unknownCodonTranslatesTo
}
## prep for output
pep <- paste(pep, collapse="")
return(pep)
}
## wrap those functions together into one:
translateGappedAln <- function(myAln, unknownCodonTranslatesTo="-") {
myCodons <- getCodons(myAln)
myAAaln <- AAStringSet(unlist(lapply(myCodons, translateCodons, unknownCodonTranslatesTo=unknownCodonTranslatesTo)))
return(myAAaln)
}
## tests:
test1 <- DNAStringSet( c("ATGGCTGCGCGGGGC", "ATGGCTGGGCGGGGC") )
test2 <- DNAStringSet( c("ATGGCTGCGCGGGGC", "ATGGCTG-GCGGGGC") )
translateGappedAln(test1)
translateGappedAln(test2)
translateGappedAln(test2, unknownCodonTranslatesTo="X")
I now see that there's another class that I could have use for my alignments: ?DNAMultipleAlignment but I don't think it helps me with codons/translation.
Hi Janet, I'll look at this after the next BioC release (next week). Thanks! H.
Any update on getting codons and translations from sequences with gaps? I'm having the same issue.
hi @sdalin,
I don't know if Herve had time to add anything to Biostrings, but in the meantime here is an ad hoc solution I used.
I only accounted for the usual 20 codons plus "---", so if you have others present you might need to modify, depending on what you want the output to look like. "---" translates to "-", and you can choose what unknown codons translate to (default is "-").
Janet
library(Biostrings)
######### some functions to translate gapped alignments:
## getCodons - a function to split sequences into codons.
# input (myAln) is a DNAStringSet with a gapped alignment
# output is a simple list, one element for each sequence. Each list element is a character vector of each codon
getCodons <- function(myAln) {
seqs <- as.character(myAln)
len <- width(myAln)[1]
starts <- seq(from=1, to=len, by=3)
ends <- starts + 2
myViews <- lapply(myAln, function(x) {
Views(x, starts, ends)
})
myCodons <- lapply(myViews, function(x) {
as.character(DNAStringSet(x))
})
myCodons
}
## translateCodons - takes a character vector of codons as input, outputs the corresponding amino acids
translateCodons <- function(myCodons, unknownCodonTranslatesTo="-") {
## make new genetic code
gapCodon <- "-"
names(gapCodon) <- "---"
my_GENETIC_CODE <- c(GENETIC_CODE, gapCodon)
## translate the codons
pep <- my_GENETIC_CODE[myCodons]
## check for codons that were not possible to translate, e.g. frameshift codons
if (sum(is.na(pep))>0) {
cat("\nwarning - there were codons I could not translate. Using this character", unknownCodonTranslatesTo, "\n\n")
pep[ which(is.na(pep)) ] <- unknownCodonTranslatesTo
}
## prep for output
pep <- paste(pep, collapse="")
return(pep)
}
## wrap the getCodons and translateCodons functions together into one:
translateGappedAln <- function(myAln, unknownCodonTranslatesTo="-") {
myCodons <- getCodons(myAln)
myAAaln <- AAStringSet(unlist(lapply(myCodons, translateCodons, unknownCodonTranslatesTo=unknownCodonTranslatesTo)))
return(myAAaln)
}
## test those functions:
test1 <- DNAStringSet( c("ATGGCTGCGCGGGGC", "ATGGCTGGGCGGGGC") )
test2 <- DNAStringSet( c("ATGGCTGCGCGGGGC", "ATGGCTG-GCGGGGC") )
translateGappedAln(test1)
translateGappedAln(test2)
translateGappedAln(test2, unknownCodonTranslatesTo="X")
This issue has been open a while, but I do have an update: I'm currently working on resolving this, hoping to have a final solution by end of next week. Currently have a working first implementation for translate()
at https://github.com/ahl27/Biostrings/tree/AllowGaps30. Working on codons(...)
next.
Current implementation converts ---
to -
, anything else with a gap throws an error.
> translate(DNAString("ATGATG")
2-letter AAString object
seq: MM
> translate(DNAString("ATG---ATG"))
3-letter AAString object
seq: M-M
> translate(DNAString("ATG---ATG"), if.fuzzy.codon='solve')
3-letter AAString object
seq: M-M
> translate(DNAString("ATGA-AATG"))
Error in .Call2("DNAStringSet_translate", x, skip_code, gap_code, dna_codes[codon_alphabet], :
unable to resolve gap codon at pos 4-6
Adding in an option to convert partial matches something else (ex. translate(DNAString("ATGA-AATG"))
returning M.M
or MXM
) would be fairly simple, I'll add that after the first pass is done.
hi there,
I've been reading in some multiple sequence alignments, as DNAStringSet objects. They're nucleotide alignments that encode proteins. I've been playing with using 'translate', but it looks like it's not set up to deal with gap characters.
I might be missing some nice alternative way to do this, but if not, I guess I'd like to suggest enhancement to better deal with in-frame nucleotide alignments of coding sequences. I think the code below will show you what I mean, but if it's not clear please let me know.
thanks!
Janet
Dr. Janet Young
Malik lab http://research.fhcrc.org/malik/en.html
Division of Basic Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Avenue N., A2-025, P.O. Box 19024, Seattle, WA 98109-1024, USA.
tel: (206) 667 4512 email: jayoung ...at... fredhutch.org