Closed Jasonmbg closed 3 years ago
@Jasonmbg Thanks for this report. My comments are below.
The reason was the way OpenCRAVAT chose a "representative" transcript for a variant. In column M ("All Mappings"), which shows variant consequences in all the transcripts the variant mapped to, the transcript where the variant causes missense exists. OpenCRAVAT chooses one transcript from column M to populate columns I to L. Default is to use MANE. For C8orf34, the MANE transcript is ENST00000518698, in which the variant mapped to an intron region. If you do not want to use MANE transcripts, then there is a workaround, albeit a bit cumbersome for now, shown below.
touch nomane
oc run [input file path and other options] --primary-transcript nomane
This will bypass using MANE and the next block of logic for choosing representative transcripts will be applied, which is the most severe consequence and then the longest transcript. This will promote the missense consequence to columns I to L.
Thanks for letting us know. VEST module 4.3.3 has a fix for it (caused by Ensembl transcript version mismatch), but was not installed on the public server. I have just installed the newest version. I tested it with some variants which did not have VEST score in your run and they now get VEST score.
OpenCRAVAT normalizes input variants (for example, changing chr1:10000:ATT>A to chr1:10001:TTdel) and then performs liftover. The hg19 coordinate columns show the hg19-coordinate positions after the normalization. Thus, hg19 position column may not match original input positions exactly (due to the normalization). This was pointed out recently, and we are considering showing original positions (before normalization and before liftover) so that the hg19 position column can be more convenient for finding variants. We'll write here when such change is made. It'll be soon.
Dear @rkimoakbioinformatics,
thank you very much for your quick and comprehensive answer !! Just three quick comments in order to fully understand your notion:
1) Concerning question 1 and the different transcripts-I fully agree that this is quite complex and it is not a direct answer, as for example ANNOVAR for annotation outputs all the possible variant isoform transcripts, whereas VEP and other tools utilize canonical or the aformentioned "MANE" transcripts-the approach you suggest, is based on annotating the vcf files through python and command line correct ?
2) Thank you also for your confirmation and for taking care of this-I will re-run the initial vcf along with the rest of the files-I would suppose this update is also immediately accessible through the web-server ?
3) Regarding the final 3rd question, just a crucial clarification: as my major goal, is to fetch any important annotation returned from openCRAVAT based on variant effect prediction scores and cancer annotation databases, any changes that you pinpointed might affect the genomic position, and perhaps the sequence classification (such as the intron-exon classification), but not any other information, correct ? And the variants returned, match the exact order of the input variants, right ? My notion for this, is regardless of the putative slight changes after hg38 liftover, to match and fetch any information to the original input variants, and perform some downstream filtering; thus, as the actual alteration and the chromosome (i.e. C>A) is not affected, thus any direct matching would be robust, correct ?
@Jasonmbg No problem. My comments are below.
As far as I know, VEP either show all mapped transcripts or selected ones, based on options (link) and ANNOVAR shows one randomly chosen transcript or all transcripts based on options. OpenCRAVAT shows all transcripts in "All Transcripts" column and one chosen transcript in "Transcript" to "Protein Change" columns. This behavior of OpenCRAVAT is the same whether it is used through Python or command-line. Let me know if I did not answer your question.
Yes, the update is in run.opencravat.org also.
any changes that you pinpointed might affect the genomic position, and perhaps the sequence classification (such as the intron-exon classification), but not any other information, correct ?
Yes (Just to make sure, OpenCRAVAT's normalization does not create any "wrong" change in variants).
And the variants returned, match the exact order of the input variants, right ?
Yes.
thus, as the actual alteration and the chromosome (i.e. C>A) is not affected, thus any direct matching would be robust, correct ?
Actually, if your input is in VCF format, your result file should have "extra vcf info" column group with "vcf position", "vcf ref", and "vcf alt" columns. They will have original input position, ref, and alt bases before normalization and liftover. You can use these columns to find original input variants. Let me know if it was not your question.
Dear @rkimoakbioinformatics ,
thank you one more time for your consideration on this matter !! Based on your answers some final comments:
2, 3. Thank you for your verification and please excuse me for any misunderstanding, I did not meant that openCRAVAT might return any "putatively" wrong variants, just only for the aformentioned differences concerning the liftover process
test.patient01.9columns.vcf.vcf.zip
thus, in your opinion as I would like to utilize all the available returned information (preferably through the excel format option from openCRAVAT), to create a "unified" score for each variant in each patient---as the order of the variants is the same as the input, could I just simply take the columns/annotators of interest in the variant section results, and just match them directly with my original input variants ? This would suffice correct ? As chromosome, REF and ALT remain the same, and only the position and some downstream consequences might be affected, as already discussed ?
Thank you in advance,
Efstathios
@Jasonmbg There is an RData-format reporter for OpenCRAVAT. I have installed it on the run.opencravat.org server. You can generate and download RData files for your jobs at the server. All the annotation columns from OpenCRAVAT analysis are available as a dataframe with RData-format files. Please let me know if this format would work for your purpose.
Dear @rkimoakbioinformatics ,
thanks a million for your consideration and effort spent for this very helpful availability-at a first glance I checked an example and everything works, I will provide additional feedback or create an additional post for any extra questions related -I also saw that also there are more created VEST4 scores returned after the update you have mentioned :)
Thank you very much for your overall help and suggestions !!
@Jasonmbg Great. Just one thing with the RData file: you can get table-in-table data as follows.
>d<-get(load("input.vcf.variant.RData"))
>library(jsonlite)
>fromJSON(d[13, "cosmic__variant_count_tissue"])
[,1] [,2]
[1,] "tissue" "count"
[2,] "stomach" "3"
[3,] "salivary gland" "1"
Dear @rkimoakbioinformatics,
thank you for pointing this useful command-I suppose that this is also applicable for "similar" constructed columns, such as Cancer Types for Cancer Hotspots ? and possibly I suppose this is more helpful to create two subcolumns for better readability of these results;
Finally, two last "naive" comment-I noticed from some of the results:
1) in some variants, while the Ref base is G and the Alt is A, the cDNA change is mentioned as c.350C>T: this is something related with positive/negative strand direction ?
2) In the relative log file of most of the uploaded vcf files, I noticed the following message part like the following: 2021/04/21 10:11:27 cravat.converter Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/cravat/cravat_convert.py", line 424, in run converter.addl_operation_for_unique_variant(wdict, no_unique_var) File "/usr/local/lib/python3.6/site-packages/cravat/modules/converters/vcf-converter/vcf-converter.py", line 297, in addl_operation_for_unique_variant info_desc = self._reader.infos[info_name] KeyError: 'ACGTNacgtnPLUS' 2021/04/21 10:11:28 cravat.converter error lines: 0
This KeyError illustrates a specific issue of importance ?
Best,
Efstathios
@Jasonmbg Yes, it'll be the same for other columns with table-like data. Can you explain a bit what you mean by two sub-columns?
Yes, if a transcript is on the minus strand and the input is on the plus strand, the bases will be changed.
The error was caused because INFO field had data for "ACGTNacgtnPLUS", but this INFO field key was not defined in the header of the input file. As a result, the whole INFO field was not processed properly. You'll see that "extra vcf info" fields are all empty. If you add something like below in the header of your input file, the result will probably have the INFO field extracted:
##INFO=<ID=ATCGNatcgnPLUS,Number=10,Type=Integer,Description="ATGCNatcgnPLUS">
##INFO=<ID=ATCGNatcgnMINUS,Number=10,Type=Integer,Description="ATGCNatcgnMINUS">
Dear @rkimoakbioinformatics,
please apologies for the late reply, sometimes different timezones are not flexible :)
For my question regarding the two sub-columns: in order to be more accurate, I would like from the specific columns like "cosmic__variant_count_tissue", and cancer_hotspots_samples to divide them into two or three sub-columns; for example, like in the column cosmic count tissue to have one column like cosmic_tissue and one additional cosmic_count, etc-but I would suppose this is a bit complicated, as one line might have different tissues with different variant counts and/or empty scores;
For the additional questions:
Best Regards,
Efstathios
Yes, OpenCRAVAT's output will still be in report files even if the INFO field was not processed successfully.
Regarding sub-columns in R, for now one way is to use jsonlite library, as shown in https://github.com/KarchinLab/open-cravat/issues/65#issuecomment-823598323.
Dear @rkimoakbioinformatics ,
sorry to return again regarding this specific post, but I noticed something weird:
when for the same patient (on your above example) I used the same commands:
fromJSON(d[13, "cosmic__variant_count_tissue"])
Error: lexical error: invalid string in json text.
tissue count 0 stomach
(right here) ------^
I noticed the above error;
A similar error happens also in another line that has various entries in the respective COSMIC column:
fromJSON(d[67, "cosmic__variant_count_tissue"])
Error: lexical error: invalid string in json text.
tissue count 0
(right here) ------^
You thing that is a package related version issue ? From my sessionInfo:
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] jsonlite_1.7.2
loaded via a namespace (and not attached):
[1] colorspace_2.0-0 ellipsis_0.3.1
[3] rprojroot_2.0.2 XVector_0.30.0
[5] GenomicRanges_1.42.0 fs_1.5.0
[7] remotes_2.2.0 ggrepel_0.9.1
[9] bit64_4.0.5 interactiveDisplayBase_1.28.0
[11] AnnotationDbi_1.52.0 fansi_0.4.2
[13] splines_4.0.3 cachem_1.0.4
[15] knitr_1.31 maftools_2.6.05
[17] pkgload_1.2.0 bcellViper_1.26.0
[19] annotate_1.68.0 dbplyr_2.1.0
[21] shiny_1.6.0 data.tree_1.0.0
[23] BiocManager_1.30.10 readr_1.4.0
[25] compiler_4.0.3 httr_1.4.2
[27] assertthat_0.2.1 Matrix_1.3-2
[29] fastmap_1.1.0 limma_3.46.0
[31] cli_2.3.1 later_1.1.0.1
[33] htmltools_0.5.1.1 prettyunits_1.1.1
[35] tools_4.0.3 gtable_0.3.0
[37] glue_1.4.2 GenomeInfoDbData_1.2.4
[39] dplyr_1.0.5 rappdirs_0.3.3
[41] tinytex_0.30 Rcpp_1.0.6
[43] limSolve_1.5.6 Biobase_2.50.0
[45] cellranger_1.1.0 vctrs_0.3.7
[47] RaggedExperiment_1.14.1 preprocessCore_1.52.1
[49] nlme_3.1-152 xfun_0.22
[51] stringr_1.4.0 ps_1.6.0
[53] testthat_3.0.2 mime_0.10
[55] lpSolve_5.6.15 lifecycle_1.0.0
[57] devtools_2.3.2 XML_4.0-0
[59] AnnotationHub_2.22.0 edgeR_3.32.1
[61] scales_1.1.1 MASS_7.3-53.1
[63] zlibbioc_1.36.0 vroom_1.4.0
[65] promises_1.2.0.1 hms_1.0.0
[67] MatrixGenerics_1.2.1 parallel_4.0.3
[69] SummarizedExperiment_1.20.0 RColorBrewer_1.1-2
[71] yaml_2.2.1 curl_4.3
[73] gridExtra_2.3 memoise_2.0.0
[75] ggplot2_3.3.3 TCGAmutations_0.3.0
[77] EPIC_1.1.5 dorothea_1.2.1
[79] stringi_1.5.3 RSQLite_2.2.3
[81] BiocVersion_3.12.0 immunedeconv_2.0.3
[83] genefilter_1.72.1 S4Vectors_0.28.1
[85] desc_1.3.0 BiocGenerics_0.36.0
[87] pkgbuild_1.2.0 BiocParallel_1.24.1
[89] testit_0.12 GenomeInfoDb_1.26.4
[91] rlang_0.4.10 pkgconfig_2.0.3
[93] matrixStats_0.58.0 bitops_1.0-6
[95] evaluate_0.14 lattice_0.20-41
[97] purrr_0.3.4 bit_4.0.4
[99] processx_3.4.5 tidyselect_1.1.0
[101] plyr_1.8.6 magrittr_2.0.1
[103] R6_2.5.0 IRanges_2.24.1
[105] generics_0.1.0 DelayedArray_0.16.2
[107] DBI_1.1.1 pillar_1.5.1
[109] withr_2.4.1 mgcv_1.8-34
[111] survival_3.2-7 RCurl_1.98-1.3
[113] tibble_3.1.0 crayon_1.4.1
[115] utf8_1.1.4 BiocFileCache_1.14.0
[117] rmarkdown_2.7 usethis_2.0.1
[119] locfit_1.5-9.4 grid_4.0.3
[121] readxl_1.3.1 sva_3.38.0
[123] data.table_1.14.0 blob_1.2.1
[125] callr_3.5.1 digest_0.6.27
[127] progeny_1.12.0 xtable_1.8-4
[129] tidyr_1.1.3 httpuv_1.5.5
[131] munsell_0.5.0 stats4_4.0.3
[133] sessioninfo_1.1.1 quadprog_1.5-8
The only thing I managed with another way and initial manipulation through tidyverse (not entirely correct):
d[13,"cosmic__variant_count_tissue"] %>% read_tsv() %>% separate(col = "tissue count",into = c("num", "tissue", "dunno"))
# A tibble: 2 x 3
num tissue dunno
<chr> <chr> <chr>
1 0 stomach 3
2 1 salivary gland
Warning message:
Expected 3 pieces. Additional pieces discarded in 1 rows [2].
Thank you in advance,
Efstathios
@Jasonmbg rdatareporter
has been updated to v1.2.0. It can be used as follows:
>oc report example_input.sqlite -t rdata
>R
>data <- get(load("example_input.variant.RData"))
>library(jsonlite)
>data[23, 18]
[1] "[{\"tissue\": \"liver\", \"count\": 3}, {\"tissue\": \"breast\", \"count\": 1}]"
>fromJSON(data[23, 18])
tissue count
1 liver 3
2 breast 1
>fromJSON(d[23, 18])["tissue"]
tissue
1 liver
2 breast
Let me know if this would work for your purpose.
Dear @rkimoakbioinformatics ,
thank you for your updated message and news !! unfortunately new issues appeared, I will try to recap as following:
1) Please excuse me for any naïve question, but before mentioning jsonlite in the above command chunk
oc report example_input.sqlite -t rdata
This is a python command ? I have also to run it ? because so far I have used only the openCRAVAT web server without running any local command line and/or other installation modules
2) For reproducibility, I downloaded for patient01-the same we have discussed in the aforementioned comments, and I used your exact commands but:
d[23, 18]
[1] 43032469
fromJSON(d[23, 18])
Error: Argument 'txt' must be a JSON string, URL or file.
So, by the above command it behaves like a classical data frame, and with d[23,18] it returns the 18th column-that is the hg19 chromosomal position for the 23rd variant
However, when I used like previously
fromJSON(d[13, "cosmic__variant_count_tissue"])
[,1] [,2]
[1,] "tissue" "count"
[2,] "stomach" "3"
[3,] "salivary gland" "1"
It seems that it works with the previous command line;
3) More concerning, is from curiosity I tried to create additional RData files for the rest of the patients that I have run already-however, when tried to create the RData file, the following error appeared:
Best regards,
Efstathios
@Jasonmbg Thanks for letting us know. Indeed I was assuming that you were using command-line. You don't have to run oc report ...
if you use the server. The server should work now. RData report generation option should show on your jobs. Please re-generate RData report files for your jobs. After RData report files have been downloaded and loaded in R, I guess that you already know these commands, but something like the below can be done:
>d<-get(load("~/Downloads/example_input.variant.RData"))
>rowno = 23
>cosmic = fromJSON(d[rowno, which(colnames(d) == "cosmic__variant_count_tissue")])
>cosmic
tissue count
1 liver 3
2 breast 2
>cosmic["tissue"]
tissue
1 liver
2 breast
>cosmic["count"]
count
1 3
2 2
Dear @rkimoakbioinformatics thank you also for your immediate feedback and updates :)
Yes downloading works properly, and I will repeat the downloading and creation of RData report files for all the analyzed patients-Hopefully anything runs smoothly, and I will provide updates based on a merged score I'm currently creating based on the overall openCRAVAT information;
Thank you one more time for your valuable help, support and time for my project !!
Best,
Efstathios
Dear openCRAVAT team,
initially congratulations for your very nice effort and toolkit for the integrative prioritization and annotation of diverse variants !! initially, regarding my post I would like to pinpoint two putative "issues" using openCRAVAT, which require further clarification; briefly, in a WES project concerning somatic point mutations for colorectal cancer, we have performed a pipeline to acquire vcf files annotated with ANNOVAR with gencode and hg19 reference genome, including as "functional" variants namely exonic non-synonymous, stop-gain, stop-loss and splice-site variants. After uploading a test vcf from one patient, and using specific annotators in the webserver version of openCRAVAT, based on the results I noticed the following discrepancies:
1) While in the original vcf file all the variants where classified as exonic and/or splicing, I noticed that in the results of openCRAVAT, around 3 variants have an empty value in the Coding column; for example, in UID 44 and gene C8orf34, is classified with openCRAVAT as an intron whereas in the original vcf as exonic: this might be explained due to the normalization, and the liftover process from hg19 to hg38 ? and due to the putative change of coordinates, exon-intron boundaries might slight differ in specific variants and based on sequence ontology might be changes ?
2) My second important question concerns the coverage and uniformity of variant effect predictor scores, such as VEST4-before using openCRAVAT, I have used the latest version of ANNOVAR with dbnsfp41a-I noticed, that while with ANNOVAR only around 3 of the 116 variants were missing VEST4 numeric score values, in the relative results of openCRAVAT, around 78 variants out of 116 had VEST4 scores-additionally, there were slight variations in the scores for the same variants, but I suppose this might be of different versions or types of exact scores used ? Do you have in mind any putative issues that might explain this difference ? For example, older versions or normalized versus "raw" scores ?
3) Finally, regarding the liftover process and the output results when clicking to include hg19 coordinates-if I have understood well, chrom, REF and ALT bases remain the same, and essentially the position might vary after the liftover process, and that is why columns Q and R are reported ? and of course this might affect the variant classification, sequence ontology and relative protein changes ? but anyhow I can safely match any information from my original vcf to the results, in order to match the input vcf variants with the returned ones ?
For further utilization, I have attached in the following zip folder the 3 respective files:
Whereas the file "test.patient01.9columns.vcf" is related to the original vcf input file, the "ANNOVAR.ACCC.merged.score.hg19_multianno" is the output of scoring process with ANNOVAR and the file with name "test.patient01.9columns.vcf" refers to the output of openCRAVAT-
VCF_cut.zip
With Kind Regards,
Efstathios