Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
456 stars 152 forks source link

HGVS input does not take into account cDNA / genome alignment gaps. #1053

Open davmlaw opened 3 years ago

davmlaw commented 3 years ago

When converting c. HGVS from cDNA to genome coordinates, VEP appears to not take into account alignment gaps, where the cDNA has insertions/deletions vs the reference sequence.

Submission example:

https://asia.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=iYp7iLetUEL5Y9pf-7647494

HGVS: NM_015120.4:c.79G>C resolves to coordinates 2:73385947-73385947 - HGVS of NM_015120.4:c.82G>C HGVS: NM_015120.4:c.82G>C resolves to coordinates 2:73385950-73385950 - HGVS of NM_015120.4:c.85G>C

NM_015120.4 has cDNA_match Gap=M185 I3 M250 - meaning there is a 3bp insertion. I think the genome coordinates -> HGVS correctly takes into account gaps, just not the other way

sarahhunt commented 3 years ago

Hi @davmlaw

Thanks for getting in touch. You are correct, RefSeq transcript-genome alignment corrections are applied when a variant with genomic coordinates (or an id from dbSNP, ClinVar, etc for which genomic coordnates can be looked up) is input, but not when starting with protein or transcript coordinates.

We would like to add this functionality, but it's not something we can prioritise at the moment, so we will ensure our docs reflect the current situation.

Best wishes, Sarah

worker000000 commented 3 years ago

so does this affect the accuraacy of vep annotation?

davmlaw commented 3 years ago

@worker000000 - the HGVS can resolve to the wrong coordinate, so the annotation will be wrong

davmlaw commented 3 years ago

@sarahhunt does this affect Ensembl transcripts? I couldn't find any alignment information in the Ensembl GFF files. Are the Ensembl transcripts genomic sequence and thus don't have the mismatch problem? If that's the case then what about different genome builds/patches having different sequence?

sarahhunt commented 3 years ago

@davmlaw - Ensembl transcripts completely match the reference genome they are annotated against, so HGVS transcript level variant descriptions will be mapped to the reference genome and annotated accurately.

Where the underlying reference sequence changed in the move from GRCh37 to GRCh38, we incremented the transcript version. We annotate haplotypes/patches separately to the main assembly and assign different Ensembl identifiers.

The number of RefSeq transcripts which don't match the reference has decreased from GRCh37 to GRCh38, but some do remain. We are collaborating with NCBI to define a set of transcripts we recommend for reporting, which completely match the reference genome. This should help reduce the complexities in resolving HGVS transcript level descriptions in future.

ifokkema commented 2 months ago

I believe this issue also caused NC_000009.12:g.92474742C>A (HGVS input in the VEP web interface) to be annotated with NM_017680.6:c.156G>T and NM_001193335.3:c.156G>T, which are invalid variants since there is a C on those positions. The correct position should be c.153. If VEP has custom code handling the alignments, perhaps using established libraries would save you all a few headaches?