Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
449 stars 151 forks source link

From which COSMIC donwlodad file do the "Existing_variation" come? #1691

Open MGCarta opened 3 months ago

MGCarta commented 3 months ago

Describe the issue

Hi, this is not a specific problem with VEP tool but more understanding issue. I have a variant in PTEN with HGVSc: NM_000314.8:c.634+5G>C which annotated via VEP command line tool gives as "Existing_variation" value the following:

rs138336847&COSV64291709&COSV64294394&COSV64310386&CS010097&CS991491

If I search the variant in the website this is nicely appearing, however I do not understand what COSMIC donwlodad file VEP is using to retrieve these values from. I can see from the header of my VEP annotated VCF file that the COSMIC version used is v98. I downloaded the Cosmic_NonCodingVariants_Tsv_v98_GRCh38.tar file (because I assume that this is a non-coding variant and it should be listed there) from COSMIC website but the variant is not reported there with any of the identifiers in the "Existing_variation" value from VEP.

Additional information

Please fill in the following sections to help us find the source of your issue as quickly as possible.

System

Full VEP command line

vep --af --af_1kg --af_gnomade --af_gnomadg --assembly GRCh38 --cache --canonical --database 0 --dir [PATH]/.vep --domains --fasta [PATH]/Homo_sapiens.GRCh38.dna.toplevel.fa.gz --force_overwrite --fork 4 --hgvs --hgvsg --hgvsg_use_accession --input_file [PATH]/benchmark_table_union.txt --mane --no_intergenic --numbers --offline --output_file [PATH]/benchmark_table_union_annotated.vcf --plugin [PATH]/spliceai_scores.raw.indel.hg38.vcf.gz,cutoff=0.5 --pubmed --refseq --symbol --vcf

Full error message

No error message

Data files (if applicable)

No data files

nuno-agostinho commented 3 months ago

Hi @MGCarta,

Existing_variation is populated by the --check_existing flag to identify known co-located variants. VEP by default uses a normalisation-based allele matching algorithm to identify known variants that match input variants.

However, for some data sources (COSMIC, HGMD), Ensembl is not licensed to redistribute allele-specific data, so VEP will report the existence of co-located variants with unknown alleles without carrying out allele matching. In order to disable this behaviour and exclude these variants, you can use the --exclude_null_alleles flag.

Please refer to our public documentation: Existing or colocated variants.

I just want to add that the data we use is directly provided by the COSMIC team, so they may not have exactly the same information compared to the files you pointed.

I will try to get in contact with the COSMIC team to understand why those identifiers are not available in COSMIC release v98. Hope this was helpful for now.

Kind regards, Nuno

MGCarta commented 3 months ago

Hi @nuno-agostinho, and thank you very much for your explanation.

  1. If I understand correctly in Existing_variation, if the variant given as input to VEP is known, there could be co-located variants that have been selected by VEP on both allele- and genomic coordinate- basis.
  2. For some databases, such as COSMIC, the selection of co-located variants is not done on an allele basis, but on a coordinate basis, is this right?
  3. Therefore, if the variant input to VEP is C>A, is it possible that in Existing_variation I have COSMIC entries that are at the same genomic position, but could be C>A as well as C>G?
  4. If I want to disable COSMIC entries with a different allele, I have to use the combination of the --check_existing and --exclude_null_alleles parameters. But does that mean I won't get COSMIC entries at all?

Best, Giulia