Closed wuttke closed 1 month ago
Hi @wuttke,
The excerpt outputs are reversed (109 is the first and 110 is the last), but I get the same results as you do. I am going to check what is going on.
Thanks for opening this issue.
Best, Nuno
True, I've edited the post to clarify. Thanks for your prompt response!
Hi @wuttke,
Our team decided to update the default representation of variants to be minimised for indels by default, as the results after minimisation are the most accurate. The original allele from the input can still be accessed with --uploaded_allele
.
We are sorry that this change is not documented. We will update our docs to reflect this.
Thanks, Nuno
Is there any way to override this default behavior and return to the previous behavior of not minimizing indels? This change completely breaks my workflow.
--uploaded_allele
is not sufficient, since it does not account for the changed coordinates.
Hi @dvg-p4,
--uploaded_allele
is not sufficient, since it does not account for the changed coordinates.
Can you show an example of what you mean by this? How does this affect your workflow?
Best, Nuno
Hi Nuno,
I'm trying to run VEP on a sets of ~20-30 million variants, with --tab
output. Our project consistently uses the VCF standard notation for indels, with a one-allele REF for insertions and a one-allele ALT for deletions, e.g. 1:69433:A:AGAT
.
It turns out that for this particular project we actually want to run VEP 109 for other reasons, so this isn't actually a blocker for meeting my current deadline--I can flesh out an example run to illustrate the problem next week.
Best, Dan
So, my input to vep has indels in "vcf-style", like such (short_vep_input.tsv
):
1 1220770 1220772 GAC/G +
1 1220772 1220772 C/T +
1 1220794 1220794 G/GCGGGCA +
1 1223144 1223146 ACT/A +
1 1223149 1223149 T/A +
1 1223153 1223153 C/T +
1 1223154 1223154 G/GAC +
1 1223154 1223156 GAC/G +
1 1223182 1223184 AAC/A +
This is the "canonical" form for variants in our database.
If I run vep 110+ with --tab
:
~/ensembl-vep/vep \
--input_file ~/vep_test/short_vep_input.tsv \
--format ensembl \
--no_stats \
--verbose \
--cache \
--offline \
--dir ~/.vep \
--assembly GRCh38 \
--show_ref_allele \
--uploaded_allele \
--output_file ~/vep_test/test_output.tsv \
--tab
I get output like such (test_output.tsv
):
## ENSEMBL VARIANT EFFECT PREDICTOR v112.0
[...]
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation REF_ALLELE UPLOADED_ALLELE IMPACT DISTANCE STRAND FLAGS
1_1220771_AC/- 1:1220771-1220772 - ENSG00000078808 ENST00000263741 Transcript intron_variant - - - - - - AC GAC/G MODIFIER - -1 -
1_1220771_AC/- 1:1220771-1220772 - ENSG00000078808 ENST00000360001 Transcript intron_variant - - - - - - AC GAC/G MODIFIER - -1 -
1_1220771_AC/- 1:1220771-1220772 - ENSG00000078808 ENST00000403997 Transcript intron_variant - - - - - - AC GAC/G MODIFIER - -1 cds_start_NF,cds_end_NF
1_1220771_AC/- 1:1220771-1220772 - ENSG00000078808 ENST00000465727 Transcript intron_variant,NMD_transcript_variant - - - - - - AC GAC/G MODIFIER - -1 -
1_1220771_AC/- 1:1220771-1220772 - ENSG00000078808 ENST00000478938 Transcript upstream_gene_variant - - - - - - AC GAC/G MODIFIER 478 -1 -
1_1220771_AC/- 1:1220771-1220772 - ENSG00000078808 ENST00000494748 Transcript non_coding_transcript_exon_variant 580-581 - - - - - AC GAC/G MODIFIER - -1 -
1_1220772_C/T 1:1220772 T ENSG00000078808 ENST00000263741 Transcript intron_variant - - - - - - C C/T MODIFIER - -1 -
1_1220772_C/T 1:1220772 T ENSG00000078808 ENST00000360001 Transcript intron_variant - - - - - - C C/T MODIFIER - -1 -
1_1220772_C/T 1:1220772 T ENSG00000078808 ENST00000403997 Transcript intron_variant - - - - - - C C/T MODIFIER - -1 cds_start_NF,cds_end_NF
1_1220772_C/T 1:1220772 T ENSG00000078808 ENST00000465727 Transcript intron_variant,NMD_transcript_variant - - - - - - C C/T MODIFIER - -1 -
1_1220772_C/T 1:1220772 T ENSG00000078808 ENST00000478938 Transcript upstream_gene_variant - - - - - - C C/T MODIFIER 479 -1 -
1_1220772_C/T 1:1220772 T ENSG00000078808 ENST00000494748 Transcript non_coding_transcript_exon_variant 580 - - - - - C C/T MODIFIER - -1 -
1_1220795_-/CGGGCA 1:1220794-1220795 CGGGCA ENSG00000078808 ENST00000263741 Transcript intron_variant - - - - - - - G/GCGGGCA MODIFIER - -1 -
1_1220795_-/CGGGCA 1:1220794-1220795 CGGGCA ENSG00000078808 ENST00000360001 Transcript intron_variant - - - - - - - G/GCGGGCA MODIFIER - -1 -
1_1220795_-/CGGGCA 1:1220794-1220795 CGGGCA ENSG00000078808 ENST00000403997 Transcript intron_variant - - - - - - - G/GCGGGCA MODIFIER - -1 cds_start_NF,cds_end_NF
1_1220795_-/CGGGCA 1:1220794-1220795 CGGGCA ENSG00000078808 ENST00000465727 Transcript intron_variant,NMD_transcript_variant - - - - - - - G/GCGGGCA MODIFIER - -1 -
1_1220795_-/CGGGCA 1:1220794-1220795 CGGGCA ENSG00000078808 ENST00000478938 Transcript upstream_gene_variant - - - - - - - G/GCGGGCA MODIFIER 501 -1 -
1_1220795_-/CGGGCA 1:1220794-1220795 CGGGCA ENSG00000078808 ENST00000494748 Transcript non_coding_transcript_exon_variant 557-558 - - - - - - G/GCGGGCA MODIFIER - -1 -
1_1223145_CT/- 1:1223145-1223146 - ENSG00000078808 ENST00000263741 Transcript intron_variant - - - - - - CT ACT/A MODIFIER - -1 -
[...]
There are many advantages to this output format--it works pretty seamlessly with awk
, cut
, column -t
, R's data.table::fread()
, etc. However, note the minimization. The "UPLOADED_ALLELE" column preserves the original ref/alt that I uploaded; but not the original coordinates. (It would also be nice to have separate chrom/pos/ref/alt columns, so as to not have to regex out the chromosome from the position.)
One alternative is to run vep with --vcf
output:
~/ensembl-vep/vep \
--input_file ~/vep_test/short_vep_input.tsv \
--format ensembl \
--no_stats \
--verbose \
--cache \
--offline \
--dir ~/.vep \
--assembly GRCh38 \
--show_ref_allele \
--uploaded_allele \
--output_file ~/vep_test/test_output.vcf \
--vcf
However, as per the VCF standard, this condenses all output to one line per variant (test_output.vcf
):
##fileformat=VCFv4.1
##VEP="v112.0" API="v112" time="2024-06-03 17:06:46" cache="/home/dgealow/.vep/homo_sapiens/112_GRCh38" ensembl=112.3add379 ensembl-funcgen=112.be19ffa ensembl-io=112.2851b6f ensembl-variation=112.4113356 1000genomes="phase3" COSMIC="98" ClinVar="202310" HGMD-PUBLIC="20204" assembly="GRCh38.p14" dbSNP="156" gencode="GENCODE 46" genebuild="2014-07" gnomADe="r2.1.1" gnomADg="v3.1.2" polyphen="2.2.3" regbuild="1.0" sift="6.2.1"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|REF_ALLELE|UPLOADED_ALLELE|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID">
##VEP-command-line='vep --assembly GRCh38 --cache --database 0 --format ensembl --input_file [PATH]/short_vep_input.tsv --no_stats --offline --output_file [PATH]/test_output.vcf --show_ref_allele --uploaded_allele --vcf --verbose'
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1220770 1_1220771_AC/- GAC G . . CSQ=-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6|||||||||AC|GAC/G||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6|||||||||AC|GAC/G||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4|||||||||AC|GAC/G||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,-|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6|||||||||AC|GAC/G||-1||HGNC|HGNC:24188,-|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron|||||||||||AC|GAC/G|478|-1||HGNC|HGNC:24188,-|non_coding_transcript_exon_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|1/3||||580-581||||||AC|GAC/G||-1||HGNC|HGNC:24188
1 1220772 1_1220772_C/T C T . . CSQ=T|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6|||||||||C|C/T||-1||HGNC|HGNC:24188,T|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6|||||||||C|C/T||-1||HGNC|HGNC:24188,T|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4|||||||||C|C/T||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,T|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6|||||||||C|C/T||-1||HGNC|HGNC:24188,T|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron|||||||||||C|C/T|479|-1||HGNC|HGNC:24188,T|non_coding_transcript_exon_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|1/3||||580||||||C|C/T||-1||HGNC|HGNC:24188
1 1220794 1_1220795_-/CGGGCA G GCGGGCA . . CSQ=CGGGCA|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6||||||||||G/GCGGGCA||-1||HGNC|HGNC:24188,CGGGCA|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6||||||||||G/GCGGGCA||-1||HGNC|HGNC:24188,CGGGCA|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4||||||||||G/GCGGGCA||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,CGGGCA|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6||||||||||G/GCGGGCA||-1||HGNC|HGNC:24188,CGGGCA|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron||||||||||||G/GCGGGCA|501|-1||HGNC|HGNC:24188,CGGGCA|non_coding_transcript_exon_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|1/3||||557-558|||||||G/GCGGGCA||-1||HGNC|HGNC:24188
1 1223144 1_1223145_CT/- ACT A . . CSQ=-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000263741|protein_coding||4/6|||||||||CT|ACT/A||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000360001|protein_coding||4/6|||||||||CT|ACT/A||-1||HGNC|HGNC:24188,-|intron_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000403997|protein_coding||3/4|||||||||CT|ACT/A||-1|cds_start_NF&cds_end_NF|HGNC|HGNC:24188,-|downstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000459994|protein_coding_CDS_not_defined|||||||||||CT|ACT/A|4126|-1||HGNC|HGNC:24188,-|intron_variant&NMD_transcript_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000465727|nonsense_mediated_decay||4/6|||||||||CT|ACT/A||-1||HGNC|HGNC:24188,-|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000478938|retained_intron|||||||||||CT|ACT/A|2852|-1||HGNC|HGNC:24188,-|upstream_gene_variant|MODIFIER|SDF4|ENSG00000078808|Transcript|ENST00000494748|retained_intron|||||||||||CT|ACT/A|1794|-1||HGNC|HGNC:24188
[...]
This has the variant identification scheme that we use, but the comma/pipe/ampersand-delimited amalgam of an INFO field is quite annoying to parse.
What would be most convenient for us would be an option to output in --tsv
format, but with vcf-style CHROM/POS/REF/ALT
columns. (Perhaps --include_vcf_id_cols
?) An "UPLOADED_COORDINATES" column (paralleling "UPLOADED_ALLELE") would also work well.
(And suggestions for workarounds I haven't thought of here would also be much appreciated!)
Ah, here's a good workaround: just pass chrom:pos:ref:alt
-style variant identifiers (or whatever style is meaningful to you) as the final column of the input (https://useast.ensembl.org/info/docs/tools/vep/vep_formats.html#default)
Hi @dvg-p4,
As you say, customising your variant identifiers in a VCF file is the best workaround to uniquely identify variants (regardless of using VEP or not).
I will now close this issue, but feel free to open new issues in case you face any other problems.
Best regards, Nuno
We do not have the --minimal flag on the command line. VEP v110 minimises the alleles in our output (we use --tab). VEP v109 does not exhibit this (unpleasant) behaviour.
Our command line: /data/programs/bin/ngs/VEP/109/vep \ --cache --dir /data/programs/bin/ngs/VEP/cache \ --input_file $1 \ --output_file $2 \ --tab \ --force_overwrite \ --canonical \ --mane \ --verbose \ --symbol \ --af_gnomad \ --offline \ --assembly GRCh38 ...
Our input: 1 930224 930224 A/ATTCTT 1:930224:A:ATTCTT
Excerpt from v109 output: 1_930224_A/ATTCTT 1:930224 ATTCTT ENSG00000187634 ENST00000616016 Transcript frameshift_variant
Excerpt from v110 output: 1930225-/TTCTT 1:930224-930225 TTCTT ENSG00000187634