Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
57 stars 16 forks source link

allow gap characters in translate and codons functions? #30

Open jayoung opened 5 years ago

jayoung commented 5 years ago

hi there,

I've been reading in some multiple sequence alignments, as DNAStringSet objects. They're nucleotide alignments that encode proteins. I've been playing with using 'translate', but it looks like it's not set up to deal with gap characters.

I might be missing some nice alternative way to do this, but if not, I guess I'd like to suggest enhancement to better deal with in-frame nucleotide alignments of coding sequences. I think the code below will show you what I mean, but if it's not clear please let me know.

thanks!

Janet


Dr. Janet Young

Malik lab http://research.fhcrc.org/malik/en.html

Division of Basic Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Avenue N., A2-025, P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 4512 email: jayoung ...at... fredhutch.org


library(Biostrings)

####### example data
### make a multiple sequence alignment of coding sequences
aln <- DNAStringSet(c( "ATGGCATCTACTTTGTATGACTATTGCAGAGTGCCCATGGGTGACATCTGTAAGAAAGATGGGGATAAGCGCTGTAAGCTT", "ATGGCATCTACTTCGTATGACTATTGCAGAGTGCCCATG---------------GAAGACGGGGATAAGCGCTGTAAGCTT", "ATGGCATCTACTTTGTATGACTATTGCAGAGTGCCCATGGGTGACATCTGTAAGAAAGATGGGGATAAGCGCTGTAAGCTT"))
names(aln) <- c("seq1","seq2","seq3")

### get ungapped seqs
ungapped <- DNAStringSet(gsub("-","",aln))

####### codons function
## codons only works if there are no gaps. How about adding an option on codons that the user can set to allow gap characters? 

codons(aln[[1]])
   # works

codons(aln[[2]])
# Error in .XString.codons(x) : 
#   some trinucleotides in 'x' contain non-base letters

## Might also be useful to allow N characters (the translate function enables this with the if.fuzzy.codon option, but the codons function is more strict):
codons(DNAString("ATGNCA"))
# Error in .XString.codons(x) : 
#   some trinucleotides in 'x' contain non-base letters

translate(DNAString("ATGNCA"), if.fuzzy.codon="solve")
#   2-letter "AAString" instance
# seq: MX

codons(DNAString("ATGNCA"), if.fuzzy.codon="solve")
# Error in codons(DNAString("ATGNCA"), if.fuzzy.codon = "solve") : 
#   unused argument (if.fuzzy.codon = "solve")

####### translation function

# works fine on the ungapped sequences
translate(ungapped)
#  A AAStringSet instance of length 3
#    width seq                                               names               
#[1]    27 MASTLYDYCRVPMGDICKKDGDKRCKL                       seq1
#[2]    22 MASTSYDYCRVPMEDGDKRCKL                            seq2
#[3]    27 MASTLYDYCRVPMGDICKKDGDKRCKL                       seq3

### but translation doesn't work when there are gaps (whether it's a single sequence or an alignment)
translate(aln)
# Error in .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet],  : 
#   in 'x[[2]]': not a base at pos 40

translate(aln[[2]])
# Error in .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet],  : 
#   not a base at pos 40

translate(aln[[1]])
#   27-letter "AAString" instance
# seq: MASTLYDYCRVPMGDICKKDGDKRCKL

### I thought I might be able to define my own genetic code to deal with this, but I can't. Would it be possible to add an option to allow genetic codes to include additional codon types?

gapCodon <- "-"
names(gapCodon) <- "---"
my_GENETIC_CODE <- c(GENETIC_CODE, gapCodon)

translate(aln[[1]], genetic.code=GENETIC_CODE)
#   27-letter "AAString" instance
# seq: MASTLYDYCRVPMGDICKKDGDKRCKL

translate(aln[[1]], genetic.code=my_GENETIC_CODE)
# Error in .normarg_genetic.code(genetic.code) : 
#   'genetic.code' must have the same names as predefined constant GENETIC_CODE

### if you do implement translation of gapped sequences, it'll be necessary to deal with cases like the following, where the alignment gap only spans part of a codon. Depending on what my downstream usage is, I usually either mark this in translated sequences as a frameshift (i.e. translate to X or to ! character) or I include a gap at that position in the translated sequence. 
translate(DNAString("ATG-CA"))

####### session info

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS/LAPACK: /app/easybuild/software/OpenBLAS/0.2.18-GCC-5.4.0-2.26-LAPACK-3.6.1/lib/libopenblas_prescottp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] Biostrings_2.52.0   XVector_0.24.0      IRanges_2.18.1     
[4] S4Vectors_0.22.0    BiocGenerics_0.30.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.30.0 compiler_3.6.1 
jayoung commented 5 years ago

PS no need to hurry to implement this for my benefit - I made my own functions that (I think) do what I want. Not elegant, and I'm sure not robust to the user doing unintended things, but they work at least on my own alignments.

Still might be a useful thing to implement in Biostrings (although maybe I should be using some other package for this?)

if you're curious:

## myAln is a DNAStringSet with a gapped alignment
getCodons <- function(myAln) {
    seqs <- as.character(myAln)
    len <- width(myAln)[1]
    starts <- seq(from=1, to=len, by=3)
    ends <- starts + 2
    myViews <- lapply(myAln, function(x) { 
        Views(x, starts, ends)
    })
    myCodons <- lapply(myViews, function(x) {
        as.character(DNAStringSet(x))
    })
    myCodons
}

## myCodons is a simple character vector, each item is a codon (like one of the items in a list generated by getCodons)
translateCodons <- function(myCodons, unknownCodonTranslatesTo="-") {
    ## make new genetic code
    gapCodon <- "-"
    names(gapCodon) <- "---"
    my_GENETIC_CODE <- c(GENETIC_CODE, gapCodon)

    ## translate the codons
    pep <- my_GENETIC_CODE[myCodons]

    ## check for codons that were not possible to translate, e.g. frameshift codons
    if (sum(is.na(pep))>0) {
        cat("\nwarning - there were codons I could not translate. Using this character", unknownCodonTranslatesTo, "\n\n")
        pep[ which(is.na(pep)) ] <- unknownCodonTranslatesTo
    }

    ## prep for output
    pep <- paste(pep, collapse="")
    return(pep)
}

## wrap those functions together into one:
translateGappedAln <- function(myAln, unknownCodonTranslatesTo="-") {
    myCodons <- getCodons(myAln)
    myAAaln <- AAStringSet(unlist(lapply(myCodons, translateCodons, unknownCodonTranslatesTo=unknownCodonTranslatesTo)))
    return(myAAaln)
}

## tests:
test1 <- DNAStringSet( c("ATGGCTGCGCGGGGC", "ATGGCTGGGCGGGGC") )
test2 <- DNAStringSet( c("ATGGCTGCGCGGGGC", "ATGGCTG-GCGGGGC") )
translateGappedAln(test1)
translateGappedAln(test2)
translateGappedAln(test2, unknownCodonTranslatesTo="X")
jayoung commented 5 years ago

I now see that there's another class that I could have use for my alignments: ?DNAMultipleAlignment but I don't think it helps me with codons/translation.

hpages commented 5 years ago

Hi Janet, I'll look at this after the next BioC release (next week). Thanks! H.

sdalin commented 4 years ago

Any update on getting codons and translations from sequences with gaps? I'm having the same issue.

jayoung commented 4 years ago

hi @sdalin,

I don't know if Herve had time to add anything to Biostrings, but in the meantime here is an ad hoc solution I used.

I only accounted for the usual 20 codons plus "---", so if you have others present you might need to modify, depending on what you want the output to look like. "---" translates to "-", and you can choose what unknown codons translate to (default is "-").

Janet


library(Biostrings)

######### some functions to translate gapped alignments:

## getCodons - a function to split sequences into codons.
# input (myAln) is a DNAStringSet with a gapped alignment
# output is a simple list, one element for each sequence. Each list element is a character vector of each codon
getCodons <- function(myAln) {
    seqs <- as.character(myAln)
    len <- width(myAln)[1]
    starts <- seq(from=1, to=len, by=3)
    ends <- starts + 2
    myViews <- lapply(myAln, function(x) { 
        Views(x, starts, ends)
    })
    myCodons <- lapply(myViews, function(x) {
        as.character(DNAStringSet(x))
    })
    myCodons
}

## translateCodons - takes a character vector of codons as input, outputs the corresponding amino acids
translateCodons <- function(myCodons, unknownCodonTranslatesTo="-") {
    ## make new genetic code
    gapCodon <- "-"
    names(gapCodon) <- "---"
    my_GENETIC_CODE <- c(GENETIC_CODE, gapCodon)

    ## translate the codons
    pep <- my_GENETIC_CODE[myCodons]

    ## check for codons that were not possible to translate, e.g. frameshift codons
    if (sum(is.na(pep))>0) {
        cat("\nwarning - there were codons I could not translate. Using this character", unknownCodonTranslatesTo, "\n\n")
        pep[ which(is.na(pep)) ] <- unknownCodonTranslatesTo
    }

    ## prep for output
    pep <- paste(pep, collapse="")
    return(pep)
}

## wrap the getCodons and translateCodons functions together into one:
translateGappedAln <- function(myAln, unknownCodonTranslatesTo="-") {
    myCodons <- getCodons(myAln)
    myAAaln <- AAStringSet(unlist(lapply(myCodons, translateCodons, unknownCodonTranslatesTo=unknownCodonTranslatesTo)))
    return(myAAaln)
}

## test those functions:
test1 <- DNAStringSet( c("ATGGCTGCGCGGGGC", "ATGGCTGGGCGGGGC") )
test2 <- DNAStringSet( c("ATGGCTGCGCGGGGC", "ATGGCTG-GCGGGGC") )
translateGappedAln(test1)
translateGappedAln(test2)
translateGappedAln(test2, unknownCodonTranslatesTo="X")
ahl27 commented 1 year ago

This issue has been open a while, but I do have an update: I'm currently working on resolving this, hoping to have a final solution by end of next week. Currently have a working first implementation for translate() at https://github.com/ahl27/Biostrings/tree/AllowGaps30. Working on codons(...) next.

Current implementation converts --- to -, anything else with a gap throws an error.

> translate(DNAString("ATGATG")
2-letter AAString object
seq: MM

> translate(DNAString("ATG---ATG"))
3-letter AAString object
seq: M-M

> translate(DNAString("ATG---ATG"), if.fuzzy.codon='solve')
3-letter AAString object
seq: M-M

> translate(DNAString("ATGA-AATG"))
Error in .Call2("DNAStringSet_translate", x, skip_code, gap_code, dna_codes[codon_alphabet],  : 
  unable to resolve gap codon at pos 4-6

Adding in an option to convert partial matches something else (ex. translate(DNAString("ATGA-AATG")) returning M.M or MXM) would be fairly simple, I'll add that after the first pass is done.