genome-nexus / annotation-tools

Tools developed for AACR GENIE to allow annotation of vcf and maf files from a number of centers and merging the results
MIT License
6 stars 15 forks source link

Genes referred to by both their approved and previous name #29

Closed thomasyu888 closed 3 years ago

thomasyu888 commented 3 years ago

@inodb There are two genes in the mutation data (v9.6) that are referred to by their both approved name and previous name:

First pair: GPR124 (460 mutations) and ADGRA2 (1 mutation) Evidence that they are both the same gene: https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:17849

Second pair: PARK2 (1202 mutations) and PRKN (3 mutations) Evidence: https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:8607

ao508 commented 3 years ago

@thomasyu888 I'm not really sure if there's anything that we can do on our end for this. If you look at the HGNC pages you shared above, the genes actually map to different NCBI Entrez Gene IDs.

Gene symbols on their own are not unique unfortunately. A former symbol for gene 1 can be the current symbol for gene 2 or an alias for yet another gene. This is why we primarily rely on Entrez Gene IDs as the gene identifier.

This is an upstream data issue. If centers are providing their data as MAF(s) then they would need to make sure that the correct entrez gene id is in the Entrez_Gene_Id column for those variants.

If the data is provided as a VCF then maybe they can provide their own mapping for symbol --> entrez gene id to properly handle these cases containing ambiguous gene symbols.

inodb commented 3 years ago

It might be good to dig a little bit deeper on our end as well apart from sorting this upstream. Like I imagine if people only give genomic locations we should probably not annotate with both approved and previous symbols. Not sure if that's what is happening here but it is possible. My guess is that VEP might be returning one or the other occasionally and then in genome nexus we are just returning whatever VEP returns. We might want to filter out previous symbols from the annotation_summary response at least. That being said I'm not entirely sure how easy it'll be to fix this if that's the case, but will take a look

inodb commented 3 years ago

@thomasyu888 i can't seem to find ADGRA2 and the PRKN here https://www.synapse.org/#!Synapse:syn5571527.255.

Would you mind sharing the VCF or MAF pos/ref/alt info for the records that give the previous symbol ADGRA2 and PRKN?

thomasyu888 commented 3 years ago

@inodb Those were manually removed from the 9.6-consortium release. You can see those mutations in the 9.5-consortium release here.

Here is the input.txt

inodb commented 3 years ago

@thomasyu888

For

ADGRA2  8       37699139        37699138        GENIE-SAGE-1-1  DEL     CCGCCCCGGGCCCTGCCCGCCGCC        -

The MAF format is incorrect, this is a deletion of 24 bases so should be:

GPR124  8       37699138        37699161        GENIE-SAGE-1-1  DEL     CCGCCCCGGGCCCTGCCCGCCGCC        -

That gives the correct hugo symbol annotation (GPR124) and results in an inframe deletion: https://www.genomenexus.org/variant/8:g.37699138_37699161del

Same for the other mutation event:

PRKN    6       162683593       162683592       GENIE-SAGE-1-1  DEL     CAGTGTGCAGAATGACAGCCAGCCCCACAGAGTCTCCTGG        -

It's incorrect MAF format and should be a deletion of 40 bases:

PARK2    6       162683593       162683632       GENIE-SAGE-1-1  DEL     CAGTGTGCAGAATGACAGCCAGCCCCACAGAGTCTCCTGG        -

That gives the correct PARK2 hugo symbol https://www.genomenexus.org/variant/6:g.162683593_162683632del

I'm not sure if this error is introduced further upstream or if it's incorrect in the data provided

We might want to add some check for this on our end:

https://github.com/genome-nexus/genome-nexus-annotation-pipeline/issues/174

But prolly good to check with the center that submitted the data as well what the intended alteration is

inodb commented 3 years ago

@leexgh just confirmed that all of these give Mutation_Status of FAILED in the output so I think we can close this