KarchinLab / open-cravat

A modular annotation tool for genomic variants
MIT License
113 stars 27 forks source link

Discrepancies in variant classification and variant effect prediction score values when using openCRAVAT and ANNOVAR #65

Closed Jasonmbg closed 3 years ago

Jasonmbg commented 3 years ago

Dear openCRAVAT team,

initially congratulations for your very nice effort and toolkit for the integrative prioritization and annotation of diverse variants !! initially, regarding my post I would like to pinpoint two putative "issues" using openCRAVAT, which require further clarification; briefly, in a WES project concerning somatic point mutations for colorectal cancer, we have performed a pipeline to acquire vcf files annotated with ANNOVAR with gencode and hg19 reference genome, including as "functional" variants namely exonic non-synonymous, stop-gain, stop-loss and splice-site variants. After uploading a test vcf from one patient, and using specific annotators in the webserver version of openCRAVAT, based on the results I noticed the following discrepancies:

1) While in the original vcf file all the variants where classified as exonic and/or splicing, I noticed that in the results of openCRAVAT, around 3 variants have an empty value in the Coding column; for example, in UID 44 and gene C8orf34, is classified with openCRAVAT as an intron whereas in the original vcf as exonic: this might be explained due to the normalization, and the liftover process from hg19 to hg38 ? and due to the putative change of coordinates, exon-intron boundaries might slight differ in specific variants and based on sequence ontology might be changes ?

2) My second important question concerns the coverage and uniformity of variant effect predictor scores, such as VEST4-before using openCRAVAT, I have used the latest version of ANNOVAR with dbnsfp41a-I noticed, that while with ANNOVAR only around 3 of the 116 variants were missing VEST4 numeric score values, in the relative results of openCRAVAT, around 78 variants out of 116 had VEST4 scores-additionally, there were slight variations in the scores for the same variants, but I suppose this might be of different versions or types of exact scores used ? Do you have in mind any putative issues that might explain this difference ? For example, older versions or normalized versus "raw" scores ?

3) Finally, regarding the liftover process and the output results when clicking to include hg19 coordinates-if I have understood well, chrom, REF and ALT bases remain the same, and essentially the position might vary after the liftover process, and that is why columns Q and R are reported ? and of course this might affect the variant classification, sequence ontology and relative protein changes ? but anyhow I can safely match any information from my original vcf to the results, in order to match the input vcf variants with the returned ones ?

For further utilization, I have attached in the following zip folder the 3 respective files:

Whereas the file "test.patient01.9columns.vcf" is related to the original vcf input file, the "ANNOVAR.ACCC.merged.score.hg19_multianno" is the output of scoring process with ANNOVAR and the file with name "test.patient01.9columns.vcf" refers to the output of openCRAVAT-

VCF_cut.zip

With Kind Regards,

Efstathios

rkimoakbioinformatics commented 3 years ago

@Jasonmbg Thanks for this report. My comments are below.

  1. The reason was the way OpenCRAVAT chose a "representative" transcript for a variant. In column M ("All Mappings"), which shows variant consequences in all the transcripts the variant mapped to, the transcript where the variant causes missense exists. OpenCRAVAT chooses one transcript from column M to populate columns I to L. Default is to use MANE. For C8orf34, the MANE transcript is ENST00000518698, in which the variant mapped to an intron region. If you do not want to use MANE transcripts, then there is a workaround, albeit a bit cumbersome for now, shown below.

    touch nomane
    oc run [input file path and other options] --primary-transcript nomane

    This will bypass using MANE and the next block of logic for choosing representative transcripts will be applied, which is the most severe consequence and then the longest transcript. This will promote the missense consequence to columns I to L.

  2. Thanks for letting us know. VEST module 4.3.3 has a fix for it (caused by Ensembl transcript version mismatch), but was not installed on the public server. I have just installed the newest version. I tested it with some variants which did not have VEST score in your run and they now get VEST score.

  3. OpenCRAVAT normalizes input variants (for example, changing chr1:10000:ATT>A to chr1:10001:TTdel) and then performs liftover. The hg19 coordinate columns show the hg19-coordinate positions after the normalization. Thus, hg19 position column may not match original input positions exactly (due to the normalization). This was pointed out recently, and we are considering showing original positions (before normalization and before liftover) so that the hg19 position column can be more convenient for finding variants. We'll write here when such change is made. It'll be soon.

Jasonmbg commented 3 years ago

Dear @rkimoakbioinformatics,

thank you very much for your quick and comprehensive answer !! Just three quick comments in order to fully understand your notion:

1) Concerning question 1 and the different transcripts-I fully agree that this is quite complex and it is not a direct answer, as for example ANNOVAR for annotation outputs all the possible variant isoform transcripts, whereas VEP and other tools utilize canonical or the aformentioned "MANE" transcripts-the approach you suggest, is based on annotating the vcf files through python and command line correct ?

2) Thank you also for your confirmation and for taking care of this-I will re-run the initial vcf along with the rest of the files-I would suppose this update is also immediately accessible through the web-server ?

3) Regarding the final 3rd question, just a crucial clarification: as my major goal, is to fetch any important annotation returned from openCRAVAT based on variant effect prediction scores and cancer annotation databases, any changes that you pinpointed might affect the genomic position, and perhaps the sequence classification (such as the intron-exon classification), but not any other information, correct ? And the variants returned, match the exact order of the input variants, right ? My notion for this, is regardless of the putative slight changes after hg38 liftover, to match and fetch any information to the original input variants, and perform some downstream filtering; thus, as the actual alteration and the chromosome (i.e. C>A) is not affected, thus any direct matching would be robust, correct ?

rkimoakbioinformatics commented 3 years ago

@Jasonmbg No problem. My comments are below.

As far as I know, VEP either show all mapped transcripts or selected ones, based on options (link) and ANNOVAR shows one randomly chosen transcript or all transcripts based on options. OpenCRAVAT shows all transcripts in "All Transcripts" column and one chosen transcript in "Transcript" to "Protein Change" columns. This behavior of OpenCRAVAT is the same whether it is used through Python or command-line. Let me know if I did not answer your question.

Yes, the update is in run.opencravat.org also.

any changes that you pinpointed might affect the genomic position, and perhaps the sequence classification (such as the intron-exon classification), but not any other information, correct ?

Yes (Just to make sure, OpenCRAVAT's normalization does not create any "wrong" change in variants).

And the variants returned, match the exact order of the input variants, right ? 

Yes.

thus, as the actual alteration and the chromosome (i.e. C>A) is not affected, thus any direct matching would be robust, correct ?

Actually, if your input is in VCF format, your result file should have "extra vcf info" column group with "vcf position", "vcf ref", and "vcf alt" columns. They will have original input position, ref, and alt bases before normalization and liftover. You can use these columns to find original input variants. Let me know if it was not your question.

Jasonmbg commented 3 years ago

Dear @rkimoakbioinformatics ,

thank you one more time for your consideration on this matter !! Based on your answers some final comments:

  1. Yes thank you for pointing out, this is crystal clear-actually the "major difference" is that with ANNOVAR you can't choose one "representative transcript", but rather by default it outputs all possible transcripts; thus, in our case, we have to consider the limitation of various scores that do not assign a single score, but like VEST4 that outputs all possible scores for isoform transcript variants (I mentioned ANNOVAR as before using openCRAVAT, it has been used in our pipeline just for exonic variant annotation, but not for any further scoring or clinical annotation)

2, 3. Thank you for your verification and please excuse me for any misunderstanding, I did not meant that openCRAVAT might return any "putatively" wrong variants, just only for the aformentioned differences concerning the liftover process

  1. Just for a final but very important verification: I also downloaded the vcf option, but as far as I tried through R I'm mostly working (tidyverse) and also command line, it is rather complex or challenging to extract all the annotation information from the INFO column, as it has additionally previous included columns, and it is a bit chaotic; additionally, in the INFO column I could not find-from a first glance-any COSMIC or CIVIC annotations, except the VEST4 scores:

test.patient01.9columns.vcf.vcf.zip

thus, in your opinion as I would like to utilize all the available returned information (preferably through the excel format option from openCRAVAT), to create a "unified" score for each variant in each patient---as the order of the variants is the same as the input, could I just simply take the columns/annotators of interest in the variant section results, and just match them directly with my original input variants ? This would suffice correct ? As chromosome, REF and ALT remain the same, and only the position and some downstream consequences might be affected, as already discussed ?

Thank you in advance,

Efstathios

rkimoakbioinformatics commented 3 years ago

@Jasonmbg There is an RData-format reporter for OpenCRAVAT. I have installed it on the run.opencravat.org server. You can generate and download RData files for your jobs at the server. All the annotation columns from OpenCRAVAT analysis are available as a dataframe with RData-format files. Please let me know if this format would work for your purpose.

Jasonmbg commented 3 years ago

Dear @rkimoakbioinformatics ,

thanks a million for your consideration and effort spent for this very helpful availability-at a first glance I checked an example and everything works, I will provide additional feedback or create an additional post for any extra questions related -I also saw that also there are more created VEST4 scores returned after the update you have mentioned :)

Thank you very much for your overall help and suggestions !!

rkimoakbioinformatics commented 3 years ago

@Jasonmbg Great. Just one thing with the RData file: you can get table-in-table data as follows.

>d<-get(load("input.vcf.variant.RData"))
>library(jsonlite)
>fromJSON(d[13, "cosmic__variant_count_tissue"])
     [,1]             [,2]   
[1,] "tissue"         "count"
[2,] "stomach"        "3"    
[3,] "salivary gland" "1" 
Jasonmbg commented 3 years ago

Dear @rkimoakbioinformatics,

thank you for pointing this useful command-I suppose that this is also applicable for "similar" constructed columns, such as Cancer Types for Cancer Hotspots ? and possibly I suppose this is more helpful to create two subcolumns for better readability of these results;

Finally, two last "naive" comment-I noticed from some of the results:

1) in some variants, while the Ref base is G and the Alt is A, the cDNA change is mentioned as c.350C>T: this is something related with positive/negative strand direction ?

2) In the relative log file of most of the uploaded vcf files, I noticed the following message part like the following: 2021/04/21 10:11:27 cravat.converter Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/cravat/cravat_convert.py", line 424, in run converter.addl_operation_for_unique_variant(wdict, no_unique_var) File "/usr/local/lib/python3.6/site-packages/cravat/modules/converters/vcf-converter/vcf-converter.py", line 297, in addl_operation_for_unique_variant info_desc = self._reader.infos[info_name] KeyError: 'ACGTNacgtnPLUS' 2021/04/21 10:11:28 cravat.converter error lines: 0

This KeyError illustrates a specific issue of importance ?

Best,

Efstathios

rkimoakbioinformatics commented 3 years ago

@Jasonmbg Yes, it'll be the same for other columns with table-like data. Can you explain a bit what you mean by two sub-columns?

  1. Yes, if a transcript is on the minus strand and the input is on the plus strand, the bases will be changed.

  2. The error was caused because INFO field had data for "ACGTNacgtnPLUS", but this INFO field key was not defined in the header of the input file. As a result, the whole INFO field was not processed properly. You'll see that "extra vcf info" fields are all empty. If you add something like below in the header of your input file, the result will probably have the INFO field extracted:

    ##INFO=<ID=ATCGNatcgnPLUS,Number=10,Type=Integer,Description="ATGCNatcgnPLUS">
    ##INFO=<ID=ATCGNatcgnMINUS,Number=10,Type=Integer,Description="ATGCNatcgnMINUS">
Jasonmbg commented 3 years ago

Dear @rkimoakbioinformatics,

please apologies for the late reply, sometimes different timezones are not flexible :)

For my question regarding the two sub-columns: in order to be more accurate, I would like from the specific columns like "cosmic__variant_count_tissue", and cancer_hotspots_samples to divide them into two or three sub-columns; for example, like in the column cosmic count tissue to have one column like cosmic_tissue and one additional cosmic_count, etc-but I would suppose this is a bit complicated, as one line might have different tissues with different variant counts and/or empty scores;

For the additional questions:

  1. Thank you for the prompt explanation
  2. Thank you in addition for pointing this out-I suppose this is not something "important", and I can ignore the relative part as far as I have included the crucial information that openCRAVAT utilizes from my respective vcf file, correct ?

Best Regards,

Efstathios

rkimoakbioinformatics commented 3 years ago

Yes, OpenCRAVAT's output will still be in report files even if the INFO field was not processed successfully.

Regarding sub-columns in R, for now one way is to use jsonlite library, as shown in https://github.com/KarchinLab/open-cravat/issues/65#issuecomment-823598323.

Jasonmbg commented 3 years ago

Dear @rkimoakbioinformatics ,

sorry to return again regarding this specific post, but I noticed something weird:

when for the same patient (on your above example) I used the same commands:

fromJSON(d[13, "cosmic__variant_count_tissue"])
Error: lexical error: invalid string in json text.
                                       tissue  count 0         stomach
                     (right here) ------^

I noticed the above error;

A similar error happens also in another line that has various entries in the respective COSMIC column:

fromJSON(d[67, "cosmic__variant_count_tissue"])
Error: lexical error: invalid string in json text.
                                       tissue  count 0                
                     (right here) ------^

You thing that is a package related version issue ? From my sessionInfo:

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] jsonlite_1.7.2

loaded via a namespace (and not attached):
  [1] colorspace_2.0-0              ellipsis_0.3.1               
  [3] rprojroot_2.0.2               XVector_0.30.0               
  [5] GenomicRanges_1.42.0          fs_1.5.0                     
  [7] remotes_2.2.0                 ggrepel_0.9.1                
  [9] bit64_4.0.5                   interactiveDisplayBase_1.28.0
 [11] AnnotationDbi_1.52.0          fansi_0.4.2                  
 [13] splines_4.0.3                 cachem_1.0.4                 
 [15] knitr_1.31                    maftools_2.6.05              
 [17] pkgload_1.2.0                 bcellViper_1.26.0            
 [19] annotate_1.68.0               dbplyr_2.1.0                 
 [21] shiny_1.6.0                   data.tree_1.0.0              
 [23] BiocManager_1.30.10           readr_1.4.0                  
 [25] compiler_4.0.3                httr_1.4.2                   
 [27] assertthat_0.2.1              Matrix_1.3-2                 
 [29] fastmap_1.1.0                 limma_3.46.0                 
 [31] cli_2.3.1                     later_1.1.0.1                
 [33] htmltools_0.5.1.1             prettyunits_1.1.1            
 [35] tools_4.0.3                   gtable_0.3.0                 
 [37] glue_1.4.2                    GenomeInfoDbData_1.2.4       
 [39] dplyr_1.0.5                   rappdirs_0.3.3               
 [41] tinytex_0.30                  Rcpp_1.0.6                   
 [43] limSolve_1.5.6                Biobase_2.50.0               
 [45] cellranger_1.1.0              vctrs_0.3.7                  
 [47] RaggedExperiment_1.14.1       preprocessCore_1.52.1        
 [49] nlme_3.1-152                  xfun_0.22                    
 [51] stringr_1.4.0                 ps_1.6.0                     
 [53] testthat_3.0.2                mime_0.10                    
 [55] lpSolve_5.6.15                lifecycle_1.0.0              
 [57] devtools_2.3.2                XML_4.0-0                    
 [59] AnnotationHub_2.22.0          edgeR_3.32.1                 
 [61] scales_1.1.1                  MASS_7.3-53.1                
 [63] zlibbioc_1.36.0               vroom_1.4.0                  
 [65] promises_1.2.0.1              hms_1.0.0                    
 [67] MatrixGenerics_1.2.1          parallel_4.0.3               
 [69] SummarizedExperiment_1.20.0   RColorBrewer_1.1-2           
 [71] yaml_2.2.1                    curl_4.3                     
 [73] gridExtra_2.3                 memoise_2.0.0                
 [75] ggplot2_3.3.3                 TCGAmutations_0.3.0          
 [77] EPIC_1.1.5                    dorothea_1.2.1               
 [79] stringi_1.5.3                 RSQLite_2.2.3                
 [81] BiocVersion_3.12.0            immunedeconv_2.0.3           
 [83] genefilter_1.72.1             S4Vectors_0.28.1             
 [85] desc_1.3.0                    BiocGenerics_0.36.0          
 [87] pkgbuild_1.2.0                BiocParallel_1.24.1          
 [89] testit_0.12                   GenomeInfoDb_1.26.4          
 [91] rlang_0.4.10                  pkgconfig_2.0.3              
 [93] matrixStats_0.58.0            bitops_1.0-6                 
 [95] evaluate_0.14                 lattice_0.20-41              
 [97] purrr_0.3.4                   bit_4.0.4                    
 [99] processx_3.4.5                tidyselect_1.1.0             
[101] plyr_1.8.6                    magrittr_2.0.1               
[103] R6_2.5.0                      IRanges_2.24.1               
[105] generics_0.1.0                DelayedArray_0.16.2          
[107] DBI_1.1.1                     pillar_1.5.1                 
[109] withr_2.4.1                   mgcv_1.8-34                  
[111] survival_3.2-7                RCurl_1.98-1.3               
[113] tibble_3.1.0                  crayon_1.4.1                 
[115] utf8_1.1.4                    BiocFileCache_1.14.0         
[117] rmarkdown_2.7                 usethis_2.0.1                
[119] locfit_1.5-9.4                grid_4.0.3                   
[121] readxl_1.3.1                  sva_3.38.0                   
[123] data.table_1.14.0             blob_1.2.1                   
[125] callr_3.5.1                   digest_0.6.27                
[127] progeny_1.12.0                xtable_1.8-4                 
[129] tidyr_1.1.3                   httpuv_1.5.5                 
[131] munsell_0.5.0                 stats4_4.0.3                 
[133] sessioninfo_1.1.1             quadprog_1.5-8     

The only thing I managed with another way and initial manipulation through tidyverse (not entirely correct):

d[13,"cosmic__variant_count_tissue"] %>% read_tsv() %>% separate(col = "tissue  count",into = c("num", "tissue", "dunno"))
# A tibble: 2 x 3
  num   tissue   dunno
  <chr> <chr>    <chr>
1 0     stomach  3    
2 1     salivary gland
Warning message:
Expected 3 pieces. Additional pieces discarded in 1 rows [2].

Thank you in advance,

Efstathios

rkimoakbioinformatics commented 3 years ago

@Jasonmbg rdatareporter has been updated to v1.2.0. It can be used as follows:

>oc report example_input.sqlite -t rdata
>R
>data <- get(load("example_input.variant.RData"))
>library(jsonlite)
>data[23, 18]
[1] "[{\"tissue\": \"liver\", \"count\": 3}, {\"tissue\": \"breast\", \"count\": 1}]"
>fromJSON(data[23, 18])
  tissue count
1  liver     3
2 breast     1
>fromJSON(d[23, 18])["tissue"]
  tissue
1  liver
2 breast

Let me know if this would work for your purpose.

Jasonmbg commented 3 years ago

Dear @rkimoakbioinformatics ,

thank you for your updated message and news !! unfortunately new issues appeared, I will try to recap as following:

1) Please excuse me for any naïve question, but before mentioning jsonlite in the above command chunk oc report example_input.sqlite -t rdata

This is a python command ? I have also to run it ? because so far I have used only the openCRAVAT web server without running any local command line and/or other installation modules

2) For reproducibility, I downloaded for patient01-the same we have discussed in the aforementioned comments, and I used your exact commands but:

d[23, 18]
[1] 43032469

fromJSON(d[23, 18])
Error: Argument 'txt' must be a JSON string, URL or file.

So, by the above command it behaves like a classical data frame, and with d[23,18] it returns the 18th column-that is the hg19 chromosomal position for the 23rd variant

However, when I used like previously

fromJSON(d[13, "cosmic__variant_count_tissue"])
     [,1]             [,2]   
[1,] "tissue"         "count"
[2,] "stomach"        "3"    
[3,] "salivary gland" "1"  

It seems that it works with the previous command line;

3) More concerning, is from curiosity I tried to create additional RData files for the rest of the patients that I have run already-however, when tried to create the RData file, the following error appeared: Issue openCRAVAT webServer

Best regards,

Efstathios

rkimoakbioinformatics commented 3 years ago

@Jasonmbg Thanks for letting us know. Indeed I was assuming that you were using command-line. You don't have to run oc report ... if you use the server. The server should work now. RData report generation option should show on your jobs. Please re-generate RData report files for your jobs. After RData report files have been downloaded and loaded in R, I guess that you already know these commands, but something like the below can be done:

>d<-get(load("~/Downloads/example_input.variant.RData"))
>rowno = 23
>cosmic = fromJSON(d[rowno, which(colnames(d) == "cosmic__variant_count_tissue")])
>cosmic
  tissue count
1  liver     3
2 breast     2
>cosmic["tissue"]
  tissue
1  liver
2 breast
>cosmic["count"]
  count
1     3
2     2
Jasonmbg commented 3 years ago

Dear @rkimoakbioinformatics thank you also for your immediate feedback and updates :)

Yes downloading works properly, and I will repeat the downloading and creation of RData report files for all the analyzed patients-Hopefully anything runs smoothly, and I will provide updates based on a merged score I'm currently creating based on the overall openCRAVAT information;

Thank you one more time for your valuable help, support and time for my project !!

Best,

Efstathios