Open davmlaw opened 3 years ago
Hi @davmlaw
Thanks for getting in touch. You are correct, RefSeq transcript-genome alignment corrections are applied when a variant with genomic coordinates (or an id from dbSNP, ClinVar, etc for which genomic coordnates can be looked up) is input, but not when starting with protein or transcript coordinates.
We would like to add this functionality, but it's not something we can prioritise at the moment, so we will ensure our docs reflect the current situation.
Best wishes, Sarah
so does this affect the accuraacy of vep annotation?
@worker000000 - the HGVS can resolve to the wrong coordinate, so the annotation will be wrong
@sarahhunt does this affect Ensembl transcripts? I couldn't find any alignment information in the Ensembl GFF files. Are the Ensembl transcripts genomic sequence and thus don't have the mismatch problem? If that's the case then what about different genome builds/patches having different sequence?
@davmlaw - Ensembl transcripts completely match the reference genome they are annotated against, so HGVS transcript level variant descriptions will be mapped to the reference genome and annotated accurately.
Where the underlying reference sequence changed in the move from GRCh37 to GRCh38, we incremented the transcript version. We annotate haplotypes/patches separately to the main assembly and assign different Ensembl identifiers.
The number of RefSeq transcripts which don't match the reference has decreased from GRCh37 to GRCh38, but some do remain. We are collaborating with NCBI to define a set of transcripts we recommend for reporting, which completely match the reference genome. This should help reduce the complexities in resolving HGVS transcript level descriptions in future.
I believe this issue also caused NC_000009.12:g.92474742C>A
(HGVS input in the VEP web interface) to be annotated with NM_017680.6:c.156G>T
and NM_001193335.3:c.156G>T
, which are invalid variants since there is a C
on those positions. The correct position should be c.153
. If VEP has custom code handling the alignments, perhaps using established libraries would save you all a few headaches?
When converting c. HGVS from cDNA to genome coordinates, VEP appears to not take into account alignment gaps, where the cDNA has insertions/deletions vs the reference sequence.
Submission example:
https://asia.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=iYp7iLetUEL5Y9pf-7647494
HGVS: NM_015120.4:c.79G>C resolves to coordinates 2:73385947-73385947 - HGVS of NM_015120.4:c.82G>C HGVS: NM_015120.4:c.82G>C resolves to coordinates 2:73385950-73385950 - HGVS of NM_015120.4:c.85G>C
NM_015120.4 has cDNA_match Gap=M185 I3 M250 - meaning there is a 3bp insertion. I think the genome coordinates -> HGVS correctly takes into account gaps, just not the other way