Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
57 stars 16 forks source link

Suggestion for Biostrings::translate - option to disambiguate #55

Closed ammaraziz closed 2 years ago

ammaraziz commented 3 years ago

Hi,

Thanks for Biostrings, it's a fantastic R package that I use frequently.

The translate function works well to fuzzy translate ambiguous nucleotides as demonstrated in the help page of the function:

dna2 <- DNAString("HTG ATH **TGR** CCC YTR TRA")
translate(dna2, if.fuzzy.codon="solve")
     seq: MIXPL*

This works exactly as documented. I'd like to suggest an enhancement if it's easy to add. The function has translated TGR to MIXPL but there are two possible translation solution a stop `orW`.

The expected output would be:

translate(dna2, if.fuzzy.codon="solve")
      seq: MI*PL*
      seq: MIWPL*

It works well for short sequences but I can see this growing with very quickly with n > 3 ambiguous bases.

If this is outside the scope of this package or the function, please feel free to close the issue.

Thanks again,

Ammar

hpages commented 2 years ago

Hi @ammaraziz ,

Such an old issue! I missed it, sorry.

Remember that translate() is vectorized i.e. it can take a DNAStringSet object of arbitrary length and it will always return an AAStringSet object parallel to the input object, that is, an AAStringSet with the same length as the input and where the i-th string is associated with the i-th string in the input.

The translate's contract is to solve each ambiguous codon by replacing it with a single letter. If we changed that to do something like you propose, the output would no longer be parallel to the input when the latter is a DNAStringSet object. Also, as you've noticed, doing this would quickly lead to a combinatorial explosion when the DNA sequence contains several ambiguous codons.

If you really want this kind of disambiguation, it's actually not too hard to do it upfront, that is, to do it on the DNAString object itself before passing it to translate(). But again, this will be manageable only if the number of ambiguities is relatively small:

library(Biostrings)

## Take a DNAString object and return a DNAStringSet object.
disambiguateDNA <- function(dna)
{
    input_is_DNAString <- is(dna, "DNAString")
    if (input_is_DNAString) {
        dna <- as.character(dna)
    } else {
        stopifnot(isSingleString(dna))
    }
    split_dna <- strsplit(dna, split="")[[1]]
    ans <- ""
    for (i in seq_along(split_dna)) {
        letter <- split_dna[[i]]
        disambiguated_letters <- strsplit(IUPAC_CODE_MAP[[letter]], split="")[[1]]
        n <- length(disambiguated_letters)
        ans <- paste0(rep(ans, each=n), disambiguated_letters)
    }
    if (input_is_DNAString)
        ans <- DNAStringSet(ans)
    ans
}

Use like this:

ambiguous_dna <- DNAString("CAACKTBCCATATGRAGAGYG")

disambiguated_dna <- disambiguateDNA(ambiguous_dna)

disambiguated_dna
# DNAStringSet object of length 24:
#      width seq
#  [1]    21 CAACGTCCCATATGAAGAGCG
#  [2]    21 CAACGTCCCATATGAAGAGTG
#  [3]    21 CAACGTCCCATATGGAGAGCG
#  [4]    21 CAACGTCCCATATGGAGAGTG
#  [5]    21 CAACGTGCCATATGAAGAGCG
#  ...   ... ...
# [20]    21 CAACTTGCCATATGGAGAGTG
# [21]    21 CAACTTTCCATATGAAGAGCG
# [22]    21 CAACTTTCCATATGAAGAGTG
# [23]    21 CAACTTTCCATATGGAGAGCG
# [24]    21 CAACTTTCCATATGGAGAGTG

translate(disambiguated_dna)
# AAStringSet object of length 24:
#      width seq
#  [1]     7 QRPI*RA
#  [2]     7 QRPI*RV
#  [3]     7 QRPIWRA
#  [4]     7 QRPIWRV
#  [5]     7 QRAI*RA
#  ...   ... ...
# [20]     7 QLAIWRV
# [21]     7 QLSI*RA
# [22]     7 QLSI*RV
# [23]     7 QLSIWRA
# [24]     7 QLSIWRV

Hope this helps.