Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
453 stars 151 forks source link

VEP v110 minimises alleles #1565

Closed wuttke closed 1 month ago

wuttke commented 11 months ago

We do not have the --minimal flag on the command line. VEP v110 minimises the alleles in our output (we use --tab). VEP v109 does not exhibit this (unpleasant) behaviour.

Our command line: /data/programs/bin/ngs/VEP/109/vep \ --cache --dir /data/programs/bin/ngs/VEP/cache \ --input_file $1 \ --output_file $2 \ --tab \ --force_overwrite \ --canonical \ --mane \ --verbose \ --symbol \ --af_gnomad \ --offline \ --assembly GRCh38 ...

Our input: 1 930224 930224 A/ATTCTT 1:930224:A:ATTCTT

Excerpt from v109 output: 1_930224_A/ATTCTT 1:930224 ATTCTT ENSG00000187634 ENST00000616016 Transcript frameshift_variant

Excerpt from v110 output: 1930225-/TTCTT 1:930224-930225 TTCTT ENSG00000187634

nuno-agostinho commented 11 months ago

Hi @wuttke,

The excerpt outputs are reversed (109 is the first and 110 is the last), but I get the same results as you do. I am going to check what is going on.

Thanks for opening this issue.

Best, Nuno

wuttke commented 11 months ago

True, I've edited the post to clarify. Thanks for your prompt response!

nuno-agostinho commented 11 months ago

Hi @wuttke,

Our team decided to update the default representation of variants to be minimised for indels by default, as the results after minimisation are the most accurate. The original allele from the input can still be accessed with --uploaded_allele.

We are sorry that this change is not documented. We will update our docs to reflect this.

Thanks, Nuno

dvg-p4 commented 5 months ago

Is there any way to override this default behavior and return to the previous behavior of not minimizing indels? This change completely breaks my workflow.

dvg-p4 commented 5 months ago

--uploaded_allele is not sufficient, since it does not account for the changed coordinates.

nuno-agostinho commented 5 months ago

Hi @dvg-p4,

--uploaded_allele is not sufficient, since it does not account for the changed coordinates.

Can you show an example of what you mean by this? How does this affect your workflow?

Best, Nuno

dvg-p4 commented 5 months ago

Hi Nuno,

I'm trying to run VEP on a sets of ~20-30 million variants, with --tab output. Our project consistently uses the VCF standard notation for indels, with a one-allele REF for insertions and a one-allele ALT for deletions, e.g. 1:69433:A:AGAT.

It turns out that for this particular project we actually want to run VEP 109 for other reasons, so this isn't actually a blocker for meeting my current deadline--I can flesh out an example run to illustrate the problem next week.

Best, Dan

dvg-p4 commented 4 months ago

So, my input to vep has indels in "vcf-style", like such (short_vep_input.tsv):

1       1220770 1220772 GAC/G   +
1       1220772 1220772 C/T     +
1       1220794 1220794 G/GCGGGCA       +
1       1223144 1223146 ACT/A   +
1       1223149 1223149 T/A     +
1       1223153 1223153 C/T     +
1       1223154 1223154 G/GAC   +
1       1223154 1223156 GAC/G   +
1       1223182 1223184 AAC/A   +

This is the "canonical" form for variants in our database.

If I run vep 110+ with --tab:

~/ensembl-vep/vep \
--input_file ~/vep_test/short_vep_input.tsv \
--format ensembl \
--no_stats \
--verbose \
--cache \
--offline \
--dir ~/.vep \
--assembly GRCh38 \
--show_ref_allele \
--uploaded_allele \
--output_file ~/vep_test/test_output.tsv \
--tab

I get output like such (test_output.tsv):

## ENSEMBL VARIANT EFFECT PREDICTOR v112.0
[...]
#Uploaded_variation Location    Allele  Gene    Feature Feature_type    Consequence cDNA_position   CDS_position    Protein_position    Amino_acids Codons  Existing_variation  REF_ALLELE  UPLOADED_ALLELE IMPACT  DISTANCE    STRAND  FLAGS
1_1220771_AC/-  1:1220771-1220772   -   ENSG00000078808 ENST00000263741 Transcript  intron_variant  -   -   -   -   -   -   AC  GAC/G   MODIFIER    -   -1  -
1_1220771_AC/-  1:1220771-1220772   -   ENSG00000078808 ENST00000360001 Transcript  intron_variant  -   -   -   -   -   -   AC  GAC/G   MODIFIER    -   -1  -
1_1220771_AC/-  1:1220771-1220772   -   ENSG00000078808 ENST00000403997 Transcript  intron_variant  -   -   -   -   -   -   AC  GAC/G   MODIFIER    -   -1  cds_start_NF,cds_end_NF
1_1220771_AC/-  1:1220771-1220772   -   ENSG00000078808 ENST00000465727 Transcript  intron_variant,NMD_transcript_variant   -   -   -   -   -   -   AC  GAC/G   MODIFIER    -   -1  -
1_1220771_AC/-  1:1220771-1220772   -   ENSG00000078808 ENST00000478938 Transcript  upstream_gene_variant   -   -   -   -   -   -   AC  GAC/G   MODIFIER    478 -1  -
1_1220771_AC/-  1:1220771-1220772   -   ENSG00000078808 ENST00000494748 Transcript  non_coding_transcript_exon_variant  580-581 -   -   -   -   -   AC  GAC/G   MODIFIER    -   -1  -
1_1220772_C/T   1:1220772   T   ENSG00000078808 ENST00000263741 Transcript  intron_variant  -   -   -   -   -   -   C   C/T MODIFIER    -   -1  -
1_1220772_C/T   1:1220772   T   ENSG00000078808 ENST00000360001 Transcript  intron_variant  -   -   -   -   -   -   C   C/T MODIFIER    -   -1  -
1_1220772_C/T   1:1220772   T   ENSG00000078808 ENST00000403997 Transcript  intron_variant  -   -   -   -   -   -   C   C/T MODIFIER    -   -1  cds_start_NF,cds_end_NF
1_1220772_C/T   1:1220772   T   ENSG00000078808 ENST00000465727 Transcript  intron_variant,NMD_transcript_variant   -   -   -   -   -   -   C   C/T MODIFIER    -   -1  -
1_1220772_C/T   1:1220772   T   ENSG00000078808 ENST00000478938 Transcript  upstream_gene_variant   -   -   -   -   -   -   C   C/T MODIFIER    479 -1  -
1_1220772_C/T   1:1220772   T   ENSG00000078808 ENST00000494748 Transcript  non_coding_transcript_exon_variant  580 -   -   -   -   -   C   C/T MODIFIER    -   -1  -
1_1220795_-/CGGGCA  1:1220794-1220795   CGGGCA  ENSG00000078808 ENST00000263741 Transcript  intron_variant  -   -   -   -   -   -   -   G/GCGGGCA   MODIFIER    -   -1  -
1_1220795_-/CGGGCA  1:1220794-1220795   CGGGCA  ENSG00000078808 ENST00000360001 Transcript  intron_variant  -   -   -   -   -   -   -   G/GCGGGCA   MODIFIER    -   -1  -
1_1220795_-/CGGGCA  1:1220794-1220795   CGGGCA  ENSG00000078808 ENST00000403997 Transcript  intron_variant  -   -   -   -   -   -   -   G/GCGGGCA   MODIFIER    -   -1  cds_start_NF,cds_end_NF
1_1220795_-/CGGGCA  1:1220794-1220795   CGGGCA  ENSG00000078808 ENST00000465727 Transcript  intron_variant,NMD_transcript_variant   -   -   -   -   -   -   -   G/GCGGGCA   MODIFIER    -   -1  -
1_1220795_-/CGGGCA  1:1220794-1220795   CGGGCA  ENSG00000078808 ENST00000478938 Transcript  upstream_gene_variant   -   -   -   -   -   -   -   G/GCGGGCA   MODIFIER    501 -1  -
1_1220795_-/CGGGCA  1:1220794-1220795   CGGGCA  ENSG00000078808 ENST00000494748 Transcript  non_coding_transcript_exon_variant  557-558 -   -   -   -   -   -   G/GCGGGCA   MODIFIER    -   -1  -
1_1223145_CT/-  1:1223145-1223146   -   ENSG00000078808 ENST00000263741 Transcript  intron_variant  -   -   -   -   -   -   CT  ACT/A   MODIFIER    -   -1  -
[...]

There are many advantages to this output format--it works pretty seamlessly with awk, cut, column -t, R's data.table::fread(), etc. However, note the minimization. The "UPLOADED_ALLELE" column preserves the original ref/alt that I uploaded; but not the original coordinates. (It would also be nice to have separate chrom/pos/ref/alt columns, so as to not have to regex out the chromosome from the position.)

One alternative is to run vep with --vcf output:

~/ensembl-vep/vep \
--input_file ~/vep_test/short_vep_input.tsv \
--format ensembl \
--no_stats \
--verbose \
--cache \
--offline \
--dir ~/.vep \
--assembly GRCh38 \
--show_ref_allele \
--uploaded_allele \
--output_file ~/vep_test/test_output.vcf \
--vcf

However, as per the VCF standard, this condenses all output to one line per variant (test_output.vcf):

##fileformat=VCFv4.1
##VEP="v112.0" API="v112" time="2024-06-03 17:06:46" cache="/home/dgealow/.vep/homo_sapiens/112_GRCh38" ensembl=112.3add379 ensembl-funcgen=112.be19ffa ensembl-io=112.2851b6f ensembl-variation=112.4113356 1000genomes="phase3" COSMIC="98" ClinVar="202310" HGMD-PUBLIC="20204" assembly="GRCh38.p14" dbSNP="156" gencode="GENCODE 46" genebuild="2014-07" gnomADe="r2.1.1" gnomADg="v3.1.2" polyphen="2.2.3" regbuild="1.0" sift="6.2.1"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|REF_ALLELE|UPLOADED_ALLELE|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID">
##VEP-command-line='vep --assembly GRCh38 --cache --database 0 --format ensembl --input_file [PATH]/short_vep_input.tsv --no_stats --offline --output_file [PATH]/test_output.vcf --show_ref_allele --uploaded_allele --vcf --verbose'
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   1220770 1_1220771_AC/-  GAC G   .   .   CSQ=-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6|||||||||AC|GAC/G||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6|||||||||AC|GAC/G||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4|||||||||AC|GAC/G||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,-|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6|||||||||AC|GAC/G||-1||HGNC|HGNC:24188,-|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron|||||||||||AC|GAC/G|478|-1||HGNC|HGNC:24188,-|non_coding_transcript_exon_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|1/3||||580-581||||||AC|GAC/G||-1||HGNC|HGNC:24188
1   1220772 1_1220772_C/T   C   T   .   .   CSQ=T|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6|||||||||C|C/T||-1||HGNC|HGNC:24188,T|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6|||||||||C|C/T||-1||HGNC|HGNC:24188,T|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4|||||||||C|C/T||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,T|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6|||||||||C|C/T||-1||HGNC|HGNC:24188,T|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron|||||||||||C|C/T|479|-1||HGNC|HGNC:24188,T|non_coding_transcript_exon_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|1/3||||580||||||C|C/T||-1||HGNC|HGNC:24188
1   1220794 1_1220795_-/CGGGCA  G   GCGGGCA .   .   CSQ=CGGGCA|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6||||||||||G/GCGGGCA||-1||HGNC|HGNC:24188,CGGGCA|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6||||||||||G/GCGGGCA||-1||HGNC|HGNC:24188,CGGGCA|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4||||||||||G/GCGGGCA||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,CGGGCA|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6||||||||||G/GCGGGCA||-1||HGNC|HGNC:24188,CGGGCA|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron||||||||||||G/GCGGGCA|501|-1||HGNC|HGNC:24188,CGGGCA|non_coding_transcript_exon_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|1/3||||557-558|||||||G/GCGGGCA||-1||HGNC|HGNC:24188
1   1223144 1_1223145_CT/-  ACT A   .   .   CSQ=-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6|||||||||CT|ACT/A||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6|||||||||CT|ACT/A||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4|||||||||CT|ACT/A||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,-|downstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000459994|protein_coding_CDS_not_defined|||||||||||CT|ACT/A|4126|-1||HGNC|HGNC:24188,-|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6|||||||||CT|ACT/A||-1||HGNC|HGNC:24188,-|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron|||||||||||CT|ACT/A|2852|-1||HGNC|HGNC:24188,-|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|||||||||||CT|ACT/A|1794|-1||HGNC|HGNC:24188
[...]

This has the variant identification scheme that we use, but the comma/pipe/ampersand-delimited amalgam of an INFO field is quite annoying to parse.

What would be most convenient for us would be an option to output in --tsv format, but with vcf-style CHROM/POS/REF/ALT columns. (Perhaps --include_vcf_id_cols?) An "UPLOADED_COORDINATES" column (paralleling "UPLOADED_ALLELE") would also work well.

(And suggestions for workarounds I haven't thought of here would also be much appreciated!)

dvg-p4 commented 4 months ago

Ah, here's a good workaround: just pass chrom:pos:ref:alt-style variant identifiers (or whatever style is meaningful to you) as the final column of the input (https://useast.ensembl.org/info/docs/tools/vep/vep_formats.html#default)

nuno-agostinho commented 1 month ago

Hi @dvg-p4,

As you say, customising your variant identifiers in a VCF file is the best workaround to uniquely identify variants (regardless of using VEP or not).

I will now close this issue, but feel free to open new issues in case you face any other problems.

Best regards, Nuno