Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
437 stars 150 forks source link

VEP custom outputs coordinates of the annotation by adding +1 to start position in Structural Variants #1685

Closed jperales closed 4 weeks ago

jperales commented 1 month ago

I find confusing that VEP custom outputs the coordinates of the annotation derived from a Structural Variant (SV) by adding +1 to the start position despite the coordinates being in a 1-based coordinate system already. It reminds me of a conversion from 0-based to 1-based coordinate system. However the --custom file is a VCF, which is 1-based already. So it would not make sense to make such conversion. Could you please tell me more about the coordinate system of the annotation derived from Structural Variants?

Moreover, I noticed that it does not happen with annotations derived from SNVs, it does only with SVs. If that is a bug, please feel free to add a label to it. Thank you!

I include toy examples below for both SNV and SV cases below with the following summary table:

Input variant (VCF format) Variant type Annotation variant (VCF format) VEP custom output coordinates for the annotation Observation
chr1 113834946 . A G . . . SNV chr1 113834946 rs2476601 A G . . . chr1:113834946-113834946 It keeps the original start position
chr19 418826 . N <DEL> . . END=467860 SV chr19 418826 variant_is_80_245973__DEL N <DEL> . PASS END=467860 chr19:418827-467860 It adds +1 to original start position

Test case: SNV

Input variant (VCF format)

chr1 113834946 . A G . . .

Custom file (VCF format), showing SNVs at the position.

Please notice that the second variant is an exact match of the input variant.

$bcftools view -H gnomad.exomes.v4.0.sites.chr1.vcf.bgz chr1:113834946-113834946 | cut -f 1-5
chr1    113834946       .       A       AGCG
chr1    113834946       rs2476601       A       G
chr1    113834946       rs2476601       A       T

Command line and output --> Annotation coordinates: "name":"chr1:113834946-113834946"

$vep --no_stats -id "chr1 113834946 . A G . . ." -o "STDOUT" --json --assembly GRCh38 --symbol --numbers --offline --custom file=gnomad.exomes.v4.0.sit
es.chr1.vcf.bgz,short_name=gnomADe,format=vcf,type=exact,coords=1,fields=AF
{"strand":1,"end":113834946,"most_severe_consequence":"missense_variant","id":".","start":113834946,"input":"chr1 113834946 . A G . . .","transcript_consequences":[{"variant_allele":"G","amino_acids":"W/R","codons":"Tgg/Cgg","gene_symbol":"PTPN22","cds_start":1858,"strand":-1,"protein_end":620,"exon":"14/21","gene_id":"ENSG00000134242","impact":"MODERATE","protein_start":620,"hgnc_id":"HGNC:9652","cdna_end":1947,"cdna_start":1947,"consequence_terms":["missense_variant"],"transcript_id":"ENST00000359785","cds_end":1858,"gene_symbol_source":"HGNC"},{"amino_acids":"W/R","variant_allele":"G","codons":"Tgg/Cgg","gene_symbol":"PTPN22","cds_start":1858,"strand":-1,"protein_end":620,"gene_id":"ENSG00000134242","exon":"14/20","impact":"MODERATE","cdna_end":1947,"protein_start":620,"hgnc_id":"HGNC:9652","consequence_terms":["missense_variant"],"cdna_start":1947,"transcript_id":"ENST00000420377","cds_end":1858,"gene_symbol_source":"HGNC"},{"transcript_id":"ENST00000460620","gene_symbol_source":"HGNC","hgnc_id":"HGNC:9652","consequence_terms":["intron_variant"],"gene_id":"ENSG00000134242","intron":"6/7","impact":"MODIFIER","variant_allele":"G","gene_symbol":"PTPN22","strand":-1},{"gene_symbol_source":"HGNC","transcript_id":"ENST00000484147","cdna_start":1899,"consequence_terms":["non_coding_transcript_exon_variant"],"hgnc_id":"HGNC:9652","cdna_end":1899,"impact":"MODIFIER","exon":"14/16","gene_id":"ENSG00000134242","strand":-1,"gene_symbol":"PTPN22","variant_allele":"G"},{"gene_id":"ENSG00000134242","exon":"9/15","protein_end":493,"impact":"MODERATE","codons":"Tgg/Cgg","variant_allele":"G","amino_acids":"W/R","strand":-1,"cds_start":1477,"gene_symbol":"PTPN22","cds_end":1477,"gene_symbol_source":"HGNC","transcript_id":"ENST00000525799","cdna_start":1566,"consequence_terms":["missense_variant"],"hgnc_id":"HGNC:9652","cdna_end":1566,"protein_start":493},{"gene_symbol":"PTPN22","cds_start":1693,"strand":-1,"amino_acids":"W/R","variant_allele":"G","codons":"Tgg/Cgg","impact":"MODERATE","protein_end":565,"exon":"12/19","gene_id":"ENSG00000134242","cdna_end":1768,"hgnc_id":"HGNC:9652","protein_start":565,"consequence_terms":["missense_variant"],"cdna_start":1768,"transcript_id":"ENST00000528414","cds_end":1693,"gene_symbol_source":"HGNC"},{"impact":"MODIFIER","exon":"10/17","gene_id":"ENSG00000134242","gene_symbol":"PTPN22","strand":-1,"variant_allele":"G","transcript_id":"ENST00000532224","gene_symbol_source":"HGNC","hgnc_id":"HGNC:9652","cdna_end":1629,"consequence_terms":["3_prime_UTR_variant","NMD_transcript_variant"],"cdna_start":1629},{"impact":"MODERATE","protein_end":596,"gene_id":"ENSG00000134242","exon":"13/20","gene_symbol":"PTPN22","cds_start":1786,"strand":-1,"amino_acids":"W/R","variant_allele":"G","codons":"Tgg/Cgg","transcript_id":"ENST00000538253","cds_end":1786,"gene_symbol_source":"HGNC","cdna_end":1916,"protein_start":596,"hgnc_id":"HGNC:9652","cdna_start":1916,"consequence_terms":["missense_variant"]},{"consequence_terms":["intron_variant","non_coding_transcript_variant"],"variant_allele":"G","strand":1,"gene_id":"ENSG00000231128","transcript_id":"ENST00000664434","intron":"4/5","impact":"MODIFIER"}],"seq_region_name":"chr1","custom_annotations":{"gnomADe":[{"allele":"G","name":"chr1:113834946-113834946","fields":{"FILTER":"PASS","AF":0.911187}}]},"allele_string":"A/G","assembly_name":"GRCh38"}

Please note the coordinate of the variant the VCF for custom annotation (chr1 113834946 rs2476601 A G) is the same as the output into "name":"chr1:113834946-113834946".

Test case: SV

Input variant (VCF format):

chr19 418826 . N <DEL> . . END=467860

Annotation variant

$bcftools view -H gnomad.v4.1.cnv.all.vcf.gz chr19:418826-467860 | grep '418826' | grep 'END=467860'
chr19   418826  variant_is_80_245973__DEL       N       <DEL>   .       PASS    END=467860;SF=6.4614e-06

Command line and output --> Annotation coordinates: "name":"chr19:418827-467860"

 $vep --no_stats -id "chr19 418826 . N <DEL> . . END=467860" -o "STDOUT" --json --assembly GRCh38 --symbol --numbers --offline --custom file=gnomad.v4.1.cnv.all.vcf.gz,short_name=gnomADe,format=vcf,type=exact,coords=1,fields=SF
{"assembly_name":"GRCh38","end":467860,"allele_string":"deletion","input":"chr19 418826 . N <DEL> . . END=467860","strand":1,"seq_region_name":"chr19","most_severe_consequence":"transcript_ablation","id":".","start":418827,"transcript_consequences":[{"exon":"1-12/13","transcript_id":"ENST00000264554","strand":-1,"impact":"HIGH","gene_symbol_source":"HGNC","consequence_terms":["stop_lost","feature_truncation","coding_sequence_variant","5_prime_UTR_variant","3_prime_UTR_variant","intron_variant"],"gene_id":"ENSG00000129946","hgnc_id":"HGNC:29869","intron":"1-12/12","bp_overlap":42207,"percentage_overlap":94.96,"variant_allele":"deletion","gene_symbol":"SHC2"},{"gene_id":"ENSG00000181781","hgnc_id":"HGNC:26841","intron":"2-3/3","bp_overlap":4500,"percentage_overlap":38.72,"variant_allele":"deletion","gene_symbol":"CIMAP1D","exon":"3-4/4","transcript_id":"ENST00000315489","impact":"HIGH","strand":-1,"consequence_terms":["stop_lost","feature_truncation","coding_sequence_variant","3_prime_UTR_variant","intron_variant"],"gene_symbol_source":"HGNC"},{"hgnc_id":"HGNC:26841","gene_id":"ENSG00000181781","gene_symbol":"CIMAP1D","percentage_overlap":39.14,"bp_overlap":4515,"variant_allele":"deletion","intron":"1-2/2","strand":-1,"impact":"HIGH","transcript_id":"ENST00000382696","exon":"2-3/3","gene_symbol_source":"HGNC","consequence_terms":["stop_lost","feature_truncation","coding_sequence_variant","3_prime_UTR_variant","intron_variant"]},{"transcript_id":"ENST00000516730","impact":"HIGH","strand":1,"consequence_terms":["transcript_ablation"],"gene_symbol_source":"HGNC","gene_id":"ENSG00000252539","hgnc_id":"HGNC:43362","bp_overlap":112,"percentage_overlap":100,"variant_allele":"deletion","gene_symbol":"RNA5SP462"},{"gene_symbol":"SHC2","percentage_overlap":32,"bp_overlap":1053,"variant_allele":"deletion","intron":"1/1","hgnc_id":"HGNC:29869","gene_id":"ENSG00000129946","gene_symbol_source":"HGNC","consequence_terms":["feature_truncation","non_coding_transcript_exon_variant","intron_variant"],"strand":-1,"impact":"HIGH","transcript_id":"ENST00000587423","exon":"1/2"},{"transcript_id":"ENST00000588376","exon":"1-2/3","strand":-1,"impact":"HIGH","gene_symbol_source":"HGNC","consequence_terms":["feature_truncation","non_coding_transcript_exon_variant","intron_variant"],"gene_id":"ENSG00000129946","hgnc_id":"HGNC:29869","percentage_overlap":64.13,"variant_allele":"deletion","bp_overlap":4002,"intron":"1-2/2","gene_symbol":"SHC2"},{"gene_symbol_source":"HGNC","consequence_terms":["transcript_ablation"],"strand":-1,"impact":"HIGH","transcript_id":"ENST00000590113","gene_symbol":"SHC2","percentage_overlap":100,"bp_overlap":3489,"variant_allele":"deletion","hgnc_id":"HGNC:29869","gene_id":"ENSG00000129946"},{"consequence_terms":["stop_lost","feature_truncation","coding_sequence_variant","5_prime_UTR_variant","3_prime_UTR_variant","intron_variant","NMD_transcript_variant"],"gene_symbol_source":"HGNC","transcript_id":"ENST00000590170","exon":"1-5/6","flags":["cds_start_NF"],"impact":"HIGH","strand":-1,"bp_overlap":17442,"variant_allele":"deletion","percentage_overlap":91.47,"intron":"1-5/5","gene_symbol":"SHC2","gene_id":"ENSG00000129946","hgnc_id":"HGNC:29869"},{"consequence_terms":["transcript_ablation"],"gene_symbol_source":"HGNC","transcript_id":"ENST00000590222","impact":"HIGH","strand":-1,"flags":["cds_start_NF"],"bp_overlap":29831,"variant_allele":"deletion","percentage_overlap":100,"gene_symbol":"SHC2","gene_id":"ENSG00000129946","hgnc_id":"HGNC:29869"},{"strand":-1,"impact":"HIGH","exon":"2-3/3","transcript_id":"ENST00000591681","gene_symbol_source":"HGNC","consequence_terms":["feature_truncation","non_coding_transcript_exon_variant","intron_variant"],"hgnc_id":"HGNC:26841","gene_id":"ENSG00000181781","gene_symbol":"CIMAP1D","intron":"1-2/2","variant_allele":"deletion","percentage_overlap":43.72,"bp_overlap":3694}],"custom_annotations":{"gnomADe":[{"name":"chr19:418827-467860","fields":{"FILTER":"PASS","SF":6.46139531678067e-06}}]}}

Please note that VEP has added +1 to start position after annotation: chr19 418826 (VCF) Vs. chr19:418827 (annotation coordinates)

System

Data files

They include:

jperales commented 1 month ago

Just to let you know that this finding is also present in VEP 112 . Thanks!

Javier

nuno-agostinho commented 1 month ago

Hi Javier,

Hope you are doing well.

The VCF specification (section 1.6.1) denotes the following:

  1. REF — reference base(s): (...) If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String <ID>) then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.

So, for a variant with a symbolic allele in VCF format, the position reflects the preceding base of the structural variant. As such, you need to increment 1 to get its start position.

Hope this was clear, but tell me if otherwise.

Cheers, Nuno

jperales commented 4 weeks ago

Hi Nuno, You made it very clear. Apologies I did not know about this VCF specification in the POS for structural variants. Then it makes sense to increment 1 the start position as it is already implemented. I close this issue. Thank you very much!

Best, Javier