predictCoding() and changes in start/stop codons?

jayoung commented 2 years ago

Hi there,

I have a SNP in the first codon of a gene (ATG->ATA). This gets annotated by predictCoding() as a nonsynonymous change. As a biologist, I see this start codon loss as a much more interesting/extreme change than an ATG->ATA change that would occur in the middle of the ORF.

It's been a while since I used other tools that annotated SNP consequences (Ensembl VEP) but I think they call it something like 'start_lost'.

What would you think about having predictCoding do something similar? For the moment I have a hack to do it myself:

mcols(y)$CONSEQUENCE <- as.character(mcols(y)$CONSEQUENCE)
testLostStart <- mcols(y)$CONSEQUENCE=="nonsynonymous" & 
    sapply(mcols(y)$PROTEINLOC, function(x) { 1 %in% x } )
if(sum(testLostStart) > 0) {
    mcols(y)[which(testLostStart),"CONSEQUENCE"] <- "start_lost"
}

The same question applies to stop codons - we biologists are very interested in SNPs that cause loss of a stop codon, but I am guessing those won't be detected by predictCoding() (my guess is just based on the possible values of CONSEQUENCE).

What would you think about also checking for stop_lost mutations?

thanks for thinking about it! all the best,

Janet

vjcitn commented 2 years ago

Thank you for this question. We might be able to introduce these new consequence tags. Because the maintainer of VariantAnnotation has left the project we are very conservative in making changes. We do have a package that interfaces to ensembleVEP -- have you tried it, and does it meet your needs/interests? If not, would you develop this issue a little further with example inputs that can be used to test the new code if we do introduce these enhancements to VariantAnnotation?

vjcitn commented 2 years ago

Should I close this issue @jayoung ?

jayoung commented 2 years ago

Hi Vince,

sorry to drop the ball on this! These days my attention is scattered between many projects. Is it Valerie who was the maintainer who left? I knew her a little bit in the Fred Hutch days: sad for us that she's moved on! Hope she's doing something fun.

As well as start-loss mutations, I just noticed that non-frameshifting insertions don't appear at at all in the output of predictCoding() - that is quite a problem: I just realised I lost some variants I should have been looking at!

Could I suggest adding a warning to ?predictCoding to let the user know that predictCoding() is pretty bare-bones, and it might be better to use ensemblVEP etc, as some functional variation might be missing from the output? I almost didn't notice I'd been missing some indels.

I have used ensemblVEP in the past for human SNPs. My current project uses a reference genome that Ensembl doesn't natively support (herpes virus), hence the appeal of a ready-to-go function like predictCoding(). Digging in a bit, I think I can prep my gff files in a way that'll let me use ensemblVEP (https://uswest.ensembl.org/info/docs/tools/vep/script/vep_custom.html) after a few steps. I could also try snpeff or other tools if that doesn't work out.

I'll go ahead and see if I can get ensemblVEP/whatever running for my current project, but if anyone on your end has the bandwidth to work on this I would love to provide some examples. As you can see from this issue I might not move very fast on it.

thanks,

Janet

jayoung commented 2 years ago

hi @vjcitn,

I did go ahead and explore some reproducible examples. There are a few other inconsistencies that make me quite reluctant to use predictCoding() for the moment - I'll stick to ensemblVEP instead. If 'predictCoding()` is going to stick around it seems worth exploring upgrading it to deal with these types of variants, although I don't know how much of a pain that would be.

The only non-reproducible part of my example is that you'll have to fiddle with the paramVEP <- VEPFlags line to get it working on your system.

thanks very much,

Janet

library("ensemblVEP")
library("VariantAnnotation")

## load yeast txdb and genome for a small reproducible example
library("TxDb.Scerevisiae.UCSC.sacCer3.sgdGene")
library("BSgenome.Scerevisiae.UCSC.sacCer3")

## set up params for ensemblVEP
paramVEP <- VEPFlags( version=104,
                      scriptPath="/Users/jayoung/bin/ensembl-vep-release-104.3/vep",
                      flags=list(species="saccharomyces_cerevisiae", 
                                 distance=0, # so that we don't report upstream/downstream consequences
                                 database=FALSE, cache=TRUE )  )
txdb <- TxDb.Scerevisiae.UCSC.sacCer3.sgdGene

##### make some fake SNPs. I'll put them in a + strand gene so they're easier to check. 
# I chose the YAL003W gene:
chrIgenes <- genes(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)[which(
    seqnames(genes(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene))=="chrI" & 
        strand(genes(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene))=="+" )]
head(chrIgenes,2)
# YAL002W     chrI 143707-147531      + |     YAL002W
# YAL003W     chrI 142174-143160      + |     YAL003W

## 1. a start lost mutation is classified as nonsynonymous. start_lost is biologically way more impactful - seems misleading to call it simply nonsynonymous.
snp1 <- VCF(rowRanges = GRanges(seqnames="chrI", ranges=IRanges(start=142174,width=1)))
ref(snp1) <- DNAStringSet(c("A"))
alt(snp1) <- DNAStringSetList("G")
snp1result <- predictCoding(snp1, txdb, BSgenome.Scerevisiae.UCSC.sacCer3)
snp1resultVEP <- ensemblVEP(writeVcf(snp1, "/Users/jayoung/Desktop/snp1.vcf", index=FALSE), paramVEP)
snp1result$CONSEQUENCE
# [1] nonsynonymous
snp1resultVEP$Consequence
# [1] "start_lost"

## 2. a 3bp deletion gets called a synonymous change (sometimes gets called nonsynonymous - depends on the context) - again, can lead to potentially high impact change being ignored. ensemblVEP calls it an inframe_deletion
snp2 <- VCF(rowRanges = GRanges(seqnames="chrI", ranges=IRanges(start=142177,width=1), 
                                expectedResult="nonsynonymous"))
ref(snp2) <- DNAStringSet(c("GCAT"))
alt(snp2) <- DNAStringSetList("G")
snp2result <- predictCoding(snp2, txdb, BSgenome.Scerevisiae.UCSC.sacCer3)
snp2resultVEP <- ensemblVEP(writeVcf(snp2, "/Users/jayoung/Desktop/snp2.vcf", index=FALSE), paramVEP)
snp2result$CONSEQUENCE
# [1] synonymous
snp2resultVEP$Consequence
# [1] "inframe_deletion"

## 3. a 3bp insertion gets called nonsynonymous - it kind of is, but perhaps that's downplaying its functional impact. ensemblVEP calls it inframe_insertion
snp3 <- VCF(rowRanges = GRanges(seqnames="chrI", ranges=IRanges(start=142180,width=1), 
                                expectedResult="insertion"))
ref(snp3) <- DNAStringSet(c("T"))
alt(snp3) <- DNAStringSetList("TCGA")
snp3result <- predictCoding(snp3, txdb, BSgenome.Scerevisiae.UCSC.sacCer3)
snp3resultVEP <- ensemblVEP(writeVcf(snp3, "/Users/jayoung/Desktop/snp3.vcf", index=FALSE), paramVEP)
snp3result$CONSEQUENCE
# [1] nonsynonymous
snp3resultVEP$Consequence
# [1] "inframe_insertion"

## 4. stop-lost mutation is classified as nonsynonymous. stop_lost is biologically way more impactful - seems misleading to call it simply nonsynonymous.
snp4 <- VCF(rowRanges = GRanges(seqnames="chrI", ranges=IRanges(start=143158,width=1), 
                                expectedResult="stopLost"))
ref(snp4) <- DNAStringSet(c("T"))
alt(snp4) <- DNAStringSetList("G")
snp4result <- predictCoding(snp4, txdb, BSgenome.Scerevisiae.UCSC.sacCer3)
   # (get a warning message from predictCoding about out-of-bound ranges - not sure why)
snp4resultVEP <- ensemblVEP(writeVcf(snp4, "/Users/jayoung/Desktop/snp4.vcf", index=FALSE), paramVEP)
snp4result$CONSEQUENCE
# [1] nonsynonymous
snp4resultVEP$Consequence
# [1] "stop_lost"

sessionInfo()
# R version 4.1.2 (2021-11-01)
# Platform: x86_64-apple-darwin17.0 (64-bit)
# Running under: macOS Monterey 12.1
# 
# Matrix products: default
# LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
# 
# locale:
#     [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# attached base packages:
#     [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
#     [1] BSgenome.Scerevisiae.UCSC.sacCer3_1.4.0     BSgenome_1.62.0                            
# [3] rtracklayer_1.54.0                          TxDb.Scerevisiae.UCSC.sacCer3.sgdGene_3.2.2
# [5] GenomicFeatures_1.46.1                      AnnotationDbi_1.56.2                       
# [7] ensemblVEP_1.36.0                           VariantAnnotation_1.40.0                   
# [9] Rsamtools_2.10.0                            Biostrings_2.62.0                          
# [11] XVector_0.34.0                              SummarizedExperiment_1.24.0                
# [13] Biobase_2.54.0                              MatrixGenerics_1.6.0                       
# [15] matrixStats_0.61.0                          GenomicRanges_1.46.1                       
# [17] GenomeInfoDb_1.30.0                         IRanges_2.28.0                             
# [19] S4Vectors_0.32.3                            BiocGenerics_0.40.0                        
# [21] BiocManager_1.30.16                        
# 
# loaded via a namespace (and not attached):
#     [1] bitops_1.0-7             bit64_4.0.5              filelock_1.0.2           RColorBrewer_1.1-2      
# [5] progress_1.2.2           httr_1.4.2               tools_4.1.2              backports_1.4.1         
# [9] utf8_1.2.2               R6_2.5.1                 rpart_4.1-15             Hmisc_4.6-0             
# [13] DBI_1.1.2                colorspace_2.0-2         nnet_7.3-16              prettyunits_1.1.1       
# [17] tidyselect_1.1.1         gridExtra_2.3            curl_4.3.2               bit_4.0.4               
# [21] compiler_4.1.2           htmlTable_2.3.0          xml2_1.3.3               DelayedArray_0.20.0     
# [25] scales_1.1.1             checkmate_2.0.0          rappdirs_0.3.3           stringr_1.4.0           
# [29] digest_0.6.29            foreign_0.8-81           rmarkdown_2.11           base64enc_0.1-3         
# [33] jpeg_0.1-9               pkgconfig_2.0.3          htmltools_0.5.2          dbplyr_2.1.1            
# [37] fastmap_1.1.0            htmlwidgets_1.5.4        rlang_0.4.12             rstudioapi_0.13         
# [41] RSQLite_2.2.9            BiocIO_1.4.0             generics_0.1.1           BiocParallel_1.28.3     
# [45] dplyr_1.0.7              RCurl_1.98-1.5           magrittr_2.0.1           GenomeInfoDbData_1.2.7  
# [49] Formula_1.2-4            Matrix_1.4-0             Rcpp_1.0.7               munsell_0.5.0           
# [53] fansi_0.5.0              lifecycle_1.0.1          stringi_1.7.6            yaml_2.2.1              
# [57] zlibbioc_1.40.0          BiocFileCache_2.2.0      plyr_1.8.6               grid_4.1.2              
# [61] blob_1.2.2               parallel_4.1.2           forcats_0.5.1            crayon_1.4.2            
# [65] lattice_0.20-45          splines_4.1.2            hms_1.1.1                KEGGREST_1.34.0         
# [69] knitr_1.37               pillar_1.6.4             rjson_0.2.20             biomaRt_2.50.1          
# [73] XML_3.99-0.8             glue_1.6.0               evaluate_0.14            latticeExtra_0.6-29     
# [77] data.table_1.14.2        png_0.1-7                vctrs_0.3.8              gtable_0.3.0            
# [81] purrr_0.3.4              assertthat_0.2.1         cachem_1.0.6             ggplot2_3.3.5           
# [85] xfun_0.29                restfulr_0.0.13          survival_3.2-13          tibble_3.1.6            
# [89] GenomicAlignments_1.30.0 memoise_2.0.1            cluster_2.1.2            ellipsis_0.3.2

vjcitn commented 2 years ago

Thanks a lot for this. I will try to get back to this in coming weeks.

Bioconductor / VariantAnnotation

predictCoding() and changes in start/stop codons? #53