Closed Jasonmbg closed 3 years ago
@Jasonmbg Thanks. I'll take a look.
@Jasonmbg I checked some of your variants for which ANNOVAR and OpenCRAVAT produced different results. It seems that there were two reasons for the difference:
How "representative" transcript was chosen for each gene by ANNOVAR and OpenCRAVAT was different. OpenCRAVAT uses MANE transcripts as the default set of representative transcripts (https://www.ncbi.nlm.nih.gov/refseq/MANE/). Since MANE does not cover all genes, if the representative transcript is not defined for a gene in MANE, then OpenCRAVAT uses the hierarchical criteria of the transcript with the most severe sequence ontology and then the longest transcript (if sequence ontology is the same).
Another reason I see is that many variants that were "intron variant" by OpenCRAVAT and "frameshift" variants by ANNOVAR were actually frameshift variants in the transcripts that were tagged as NMD transcripts by Gencode (https://www.gencodegenes.org/pages/biotypes.html. See "nonsense_mediated_decay"). NMD transcripts are supposed to be rapidly degraded in cells, so in my opinion they may not be the best choice for representative transcripts.
Considering these, OpenCRAVAT's decisions do not look "wrong". However, if you want to ignore MANE and get the most severe sequence ontology in all the transcripts that a variant is mapped to, then it is currently not automatic and you'll have to parse the "All mappings" column. For this, we can consider an option to tell OpenCRAVAT not to use MANE. If you would need this feature, please let us know here.
Dear @rkimoakbioinformatics,
thank you very much for your feedback and comprehensive answer, as also for this exciting discussion-I will contact asap and discuss it with my collaborators, but from your explanation and guidelines, indeed seems that both MANE to have an accurate transcript representation, as also for the part of the intron variants seems definitely solid and strong-I will return for extra comments, but at a first glance having a representative transcript solves many issues and helps with interpretability-
Thank you in advance,
Efstathios
Dear openCRAVAT team,
Based on our previous discussion regarding the OpenCRAVAT pathogenicity and cancer annotators based on 33 CRC vcf files uploaded in the web server (https://github.com/KarchinLab/open-cravat/issues/65), I would like to pinpoint some discrepancies with variant classification concerning Indel variants, especially with ANNOVAR:
In detail, when uploading a vcf file based on hg19 containing only Indel variants for a test patient, I noticed a lot of differences concerning the functional classification of variants between ANNOVAR (gencode hg19) and after liftover process with openCRAVAT hg38 results; for example, regarding the deviation in classification-mostly frameshift variants, classified as intron variants from openCRAVAT-
this in your opinion and expertise, would be due to the fact generally Indel discovery is prone to errors, and related with the hg38 liftover process, as also different “annotation procedures”, this would explain this deviation ? but this discrepancy should not worry me ?
For simplicity, I have updated one excel file, to compare the sequence ontology variant classification of openCRAVAT, with the exonic classification that has already been performed with our dkfz pipeline based on gencode/ANNOVAR (https://github.com/DKFZ-ODCF/SNVCallingWorkflow), as additionally the initial vcf annotated file and the openCRAVAT results:
Mutations.FunctionalClassification.SNPs.Indels.xlsx
indel_OE0232_CRC_ACCC_09_somatic_functional_indels_conf_8_to_10.zip
ACCC_09.selected.Indels.test.vcf.xlsx
Thank you in advance,
Efstathios