cancervariants / gene-harmony-analysis

MIT License
3 stars 0 forks source link

Automate identification of expanded alias name #15

Open mcannon068nw opened 1 month ago

mcannon068nw commented 1 month ago

One problem for Aim 1 in Anastasia's dissertation is that once a gene-alias collision pair has been identified, we have to determine what the collision symbol actually represents in the context of the parent symbol. For example, CAP is listed as an alias for BRD4 but this symbol collides with so many other CAP aliases. In the context of BRD4, CAP actually refers to 'chromosome associated protein'. What cap stands for differs across different parent gene symbols. While this can be manually curated for a small set of genes, there exist over 100,000 gene-alias pairs to consider and so a programmatic approach will be needed.

Additionally, a separate but related problem will be to programmatically identify the type of collision(?).

anastasiabratulin commented 1 month ago

Thank you for writing this out so eloquently! Yes, the problem is classifying the relationship between concept and alias. This relationship can be extracted from the symbol expansion, which is why we need a way to pull out the expansions programatically.