Closed thomasyu888 closed 3 years ago
@thomasyu888 I'm not really sure if there's anything that we can do on our end for this. If you look at the HGNC pages you shared above, the genes actually map to different NCBI Entrez Gene IDs.
Gene symbols on their own are not unique unfortunately. A former symbol for gene 1 can be the current symbol for gene 2 or an alias for yet another gene. This is why we primarily rely on Entrez Gene IDs as the gene identifier.
This is an upstream data issue. If centers are providing their data as MAF(s) then they would need to make sure that the correct entrez gene id is in the Entrez_Gene_Id
column for those variants.
If the data is provided as a VCF then maybe they can provide their own mapping for symbol --> entrez gene id to properly handle these cases containing ambiguous gene symbols.
It might be good to dig a little bit deeper on our end as well apart from sorting this upstream. Like I imagine if people only give genomic locations we should probably not annotate with both approved and previous symbols. Not sure if that's what is happening here but it is possible. My guess is that VEP might be returning one or the other occasionally and then in genome nexus we are just returning whatever VEP returns. We might want to filter out previous symbols from the annotation_summary response at least. That being said I'm not entirely sure how easy it'll be to fix this if that's the case, but will take a look
@thomasyu888 i can't seem to find ADGRA2
and the PRKN
here https://www.synapse.org/#!Synapse:syn5571527.255.
Would you mind sharing the VCF or MAF pos/ref/alt info for the records that give the previous symbol ADGRA2 and PRKN?
@thomasyu888
For
ADGRA2 8 37699139 37699138 GENIE-SAGE-1-1 DEL CCGCCCCGGGCCCTGCCCGCCGCC -
The MAF format is incorrect, this is a deletion of 24 bases so should be:
GPR124 8 37699138 37699161 GENIE-SAGE-1-1 DEL CCGCCCCGGGCCCTGCCCGCCGCC -
That gives the correct hugo symbol annotation (GPR124) and results in an inframe deletion: https://www.genomenexus.org/variant/8:g.37699138_37699161del
Same for the other mutation event:
PRKN 6 162683593 162683592 GENIE-SAGE-1-1 DEL CAGTGTGCAGAATGACAGCCAGCCCCACAGAGTCTCCTGG -
It's incorrect MAF format and should be a deletion of 40 bases:
PARK2 6 162683593 162683632 GENIE-SAGE-1-1 DEL CAGTGTGCAGAATGACAGCCAGCCCCACAGAGTCTCCTGG -
That gives the correct PARK2 hugo symbol https://www.genomenexus.org/variant/6:g.162683593_162683632del
I'm not sure if this error is introduced further upstream or if it's incorrect in the data provided
We might want to add some check for this on our end:
https://github.com/genome-nexus/genome-nexus-annotation-pipeline/issues/174
But prolly good to check with the center that submitted the data as well what the intended alteration is
@leexgh just confirmed that all of these give Mutation_Status
of FAILED
in the output so I think we can close this
@inodb There are two genes in the mutation data (v9.6) that are referred to by their both approved name and previous name:
First pair: GPR124 (460 mutations) and ADGRA2 (1 mutation) Evidence that they are both the same gene: https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:17849
Second pair: PARK2 (1202 mutations) and PRKN (3 mutations) Evidence: https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:8607