Tuks-ICMM / Pharmacogenetic-Analysis-Pipeline

A Snakemake powered pipeline developed to perform variant-effect-prediction and frequency analysis given multiple Variant Call Format datasets. This has been developed in partial fulfilment of a MSc in Bioinformatics at the University of Pretoria by Graeme Ford.
https://tuks-icmm.github.io/Pharmacogenetic-Analysis-Pipeline/
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

[FEATURE] | HGVS addition to pipeline #8

Open Fatimabp opened 3 years ago

Fatimabp commented 3 years ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered, if any A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

G-kodes commented 3 years ago

There is a designated python package for generating accurate HGVS notation. Ideally, I would like to implement that and use the generated HGVS notations to query the E! Ensemble database (This provides the universally unique guarantee of HGVS notation when querying variants). This would also facilitate the ALFA project integration mentioned in #9 https://hgvs.readthedocs.io/en/stable/index.html

G-kodes commented 3 years ago

It has been brought to my attention that the HGVS python package is not compatible with Windows yet. The creators and maintainers of a dependency package have no intentions to make their package compatible with windows, however, it is an optional dependency, so a new roadmap item has been registered to mark this dependency accordingly so that it does not break on windows. until then, we will have to write this code on a Linux machine in order to debug it.

https://github.com/biocommons/hgvs/issues/522

G-kodes commented 3 years ago

I have performed a proof-of-concept test on a Linux machine using variant rs2259219 as a test reference. Using the following information:

Start Coordinate: 40843345
Stop Coordinate: 40843345
Reference Allele: C
Alternate Allele: C
Transcript ID: NC_000019.10
Transcript Type g (Genomic)

I managed to compile NC_000019.10:g.40843345C>G which matches the notation provided by E! Ensemble. The next issue is making sure that during our querying, we have access to all this information to be able to construct HGVS notation per variant and set that as our new IDs.

Fatimabp commented 3 years ago

Hi guys, been reading up on HGVS nomenclature. Just a few issues I was concerned with. Although HGVS is the most accurate way for representing variants, there does seem to be some issues because of the way things are named.

  1. We have to ensure we using the correct version numbers. Some papers express HGVS with gene names which would make it difficult to identify the protein isoform or version they referring to.
  2. Repeat shifting: VCF deletions of repeats are shifted left but with HGVS they are shifted right. If a variant is referred to by two different locations it might not be identified as the same variant.
  3. The rules for HGVS nomenclature get insanely complicated to interpret and write, especially for introns and non-coding regions. Let me know what you guys think.