Closed ammaraziz closed 2 years ago
Hi @ammaraziz ,
Such an old issue! I missed it, sorry.
Remember that translate()
is vectorized i.e. it can take a DNAStringSet object of arbitrary length and it will always return an AAStringSet object parallel to the input object, that is, an AAStringSet with the same length as the input and where the i-th string is associated with the i-th string in the input.
The translate
's contract is to solve each ambiguous codon by replacing it with a single letter. If we changed that to do something like you propose, the output would no longer be parallel to the input when the latter is a DNAStringSet object. Also, as you've noticed, doing this would quickly lead to a combinatorial explosion when the DNA sequence contains several ambiguous codons.
If you really want this kind of disambiguation, it's actually not too hard to do it upfront, that is, to do it on the DNAString object itself before passing it to translate()
. But again, this will be manageable only if the number of ambiguities is relatively small:
library(Biostrings)
## Take a DNAString object and return a DNAStringSet object.
disambiguateDNA <- function(dna)
{
input_is_DNAString <- is(dna, "DNAString")
if (input_is_DNAString) {
dna <- as.character(dna)
} else {
stopifnot(isSingleString(dna))
}
split_dna <- strsplit(dna, split="")[[1]]
ans <- ""
for (i in seq_along(split_dna)) {
letter <- split_dna[[i]]
disambiguated_letters <- strsplit(IUPAC_CODE_MAP[[letter]], split="")[[1]]
n <- length(disambiguated_letters)
ans <- paste0(rep(ans, each=n), disambiguated_letters)
}
if (input_is_DNAString)
ans <- DNAStringSet(ans)
ans
}
Use like this:
ambiguous_dna <- DNAString("CAACKTBCCATATGRAGAGYG")
disambiguated_dna <- disambiguateDNA(ambiguous_dna)
disambiguated_dna
# DNAStringSet object of length 24:
# width seq
# [1] 21 CAACGTCCCATATGAAGAGCG
# [2] 21 CAACGTCCCATATGAAGAGTG
# [3] 21 CAACGTCCCATATGGAGAGCG
# [4] 21 CAACGTCCCATATGGAGAGTG
# [5] 21 CAACGTGCCATATGAAGAGCG
# ... ... ...
# [20] 21 CAACTTGCCATATGGAGAGTG
# [21] 21 CAACTTTCCATATGAAGAGCG
# [22] 21 CAACTTTCCATATGAAGAGTG
# [23] 21 CAACTTTCCATATGGAGAGCG
# [24] 21 CAACTTTCCATATGGAGAGTG
translate(disambiguated_dna)
# AAStringSet object of length 24:
# width seq
# [1] 7 QRPI*RA
# [2] 7 QRPI*RV
# [3] 7 QRPIWRA
# [4] 7 QRPIWRV
# [5] 7 QRAI*RA
# ... ... ...
# [20] 7 QLAIWRV
# [21] 7 QLSI*RA
# [22] 7 QLSI*RV
# [23] 7 QLSIWRA
# [24] 7 QLSIWRV
Hope this helps.
Hi,
Thanks for Biostrings, it's a fantastic R package that I use frequently.
The
translate
function works well to fuzzy translate ambiguous nucleotides as demonstrated in the help page of the function:This works exactly as documented. I'd like to suggest an enhancement if it's easy to add. The function has translated
TGR
to MIXPL but there are two possible translation solution a stop `or
W`.The expected output would be:
It works well for short sequences but I can see this growing with very quickly with n > 3 ambiguous bases.
If this is outside the scope of this package or the function, please feel free to close the issue.
Thanks again,
Ammar