Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
57 stars 16 forks source link

translate uses alternative start codons by default #24

Closed FelixErnst closed 5 years ago

FelixErnst commented 5 years ago

I am not sure, that this behavior is intended:

library(Biostrings)
translate(DNAStringSet(c("TTG","CTG","TAT")))
#>   A AAStringSet instance of length 3
#>     width seq
#> [1]     1 M
#> [2]     1 M
#> [3]     1 Y
GENETIC_CODE
#> TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT CTC 
#> "F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "*" "W" "L" "L" 
#> CTA CTG CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG ATT ATC ATA ATG 
#> "L" "L" "P" "P" "P" "P" "H" "H" "Q" "Q" "R" "R" "R" "R" "I" "I" "I" "M" 
#> ACT ACC ACA ACG AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC 
#> "T" "T" "T" "T" "N" "N" "K" "K" "S" "S" "R" "R" "V" "V" "V" "V" "A" "A" 
#> GCA GCG GAT GAC GAA GAG GGT GGC GGA GGG 
#> "A" "A" "D" "D" "E" "E" "G" "G" "G" "G" 
#> attr(,"alt_init_codons")
#> [1] "TTG" "CTG"

In my opinion TTG and CTG should return L, shouldn't they?

FelixErnst commented 5 years ago

A sorry my bad… rtm to myself

ababaian commented 4 years ago

Good afternoon,

I recently tracked down an error in my workflow to some non-intuitive behavior in Biostrings::translate() function, hacked a work-around and then saw you updated the fix already. I wrote this up, thought I'd leave it here if someone runs into this.

In essence, when I was translating a short cDNA fragment CTGACGCGAGCAGCCAAG, it was reading the CTG as a non-standard initiation site, the resulting peptide was MTRAAK.

I was matching this against a trypsin digested peptide library for mass-spec, the trypsin fragment was LTRAAK, and not MTRAAK.

The standard GENETIC_CODE has the attribute alt_init_codons == "TTG" "CTG"

So I could hack around alternative initiation by setting

# From
#  GENETIC_CODE_TABLE$Starts[1] = "---M---------------M---------------M----------------------------"
#  attr(GENETIC_CODE, "alt_init_codons") = c("TTG", "CTG")

#To
  GENETIC_CODE_TABLE$Starts[1] = "-----------------------------------M----------------------------"
  attr(GENETIC_CODE, "alt_init_codons") = "ATG"

and running translate with the modified GENETIC_CODE table

translate(cDNA.fragment, genetic.code = GENETIC_CODE)

I believe in the updated version it should simply be

translate(cDNA.fragment, no.init.codon = T)
FelixErnst commented 4 years ago

Hi Artem @ababaian,

Please have a look at the manual. In the current version

translate(cDNA.fragment, no.init.codon = FALSE)

Behaves like you want it to. I would avoid touching GENETIC_CODE_TABLE

Felix

hpages commented 4 years ago

@FelixErnst I think @ababaian wants to use no.init.codon=TRUE here:

> library(Biostrings)
> translate(DNAString("CTGACGCGAGCAGCCAAG"), no.init.codon=TRUE)
  6-letter "AAString" instance
seq: LTRAAK

@ababaian Looks like you figured this out already. Didn't you?

FelixErnst commented 4 years ago

@hpages @ababaian ah sorry for the mix up.

I meant no.init.codon=TRUE, which was the solution for my initial "problem", which works out of the box with any modifications of GENETIC_CODE_TABLE

ababaian commented 4 years ago

I'm good yes, my version of Biostrings just doesn't have the no.init.codon flag so I have the work-around. =D

hpages commented 4 years ago

mmhh... no.init.codon was introduced in Bioconductor 3.8. So you are using a version of Bioconductor that is old and not supported! I would strongly recommend that you update to the most recent version (3.10) released this week! See https://bioconductor.org/news/bioc_3_10_release/

singing-scientist commented 4 years ago

Thanks very much for these insights! I do think most people would never dream that start-of-DNAString 'CTG' and 'TTG' triplets would be translated as alternative START codons by default... I wonder if you'd consider changing the default behavior to no.init.codon=TRUE?

hpages commented 4 years ago

Thanks for the feedback.

The best default value for no.init.codon really depends on your use case: do your DNA sequences represent full CDS sequences or CDS chunks? Today your use case is the latter so you complain loudly about the inadequacy of the default behavior. Surely, if the default behavior was no.init.codon=TRUE, it would be users with the former use case who would now complain.

I'm not inclined to change the default behavior because:

I like to think that the reason people almost never complained about the current default behavior is because they didn't miss that example.

singing-scientist commented 4 years ago

Thanks very much for your help! It is certainly my fault that I did not read the example — I hope you'll accept my apology for my carelessness. Your reasons for leaving it as-is make good sense, and it's certainly possible that most people know that this is how it works. Thanks again!