Open mcannon068nw opened 1 month ago
Thank you for writing this out so eloquently! Yes, the problem is classifying the relationship between concept and alias. This relationship can be extracted from the symbol expansion, which is why we need a way to pull out the expansions programatically.
One problem for Aim 1 in Anastasia's dissertation is that once a gene-alias collision pair has been identified, we have to determine what the collision symbol actually represents in the context of the parent symbol. For example, CAP is listed as an alias for BRD4 but this symbol collides with so many other CAP aliases. In the context of BRD4, CAP actually refers to 'chromosome associated protein'. What cap stands for differs across different parent gene symbols. While this can be manually curated for a small set of genes, there exist over 100,000 gene-alias pairs to consider and so a programmatic approach will be needed.
Additionally, a separate but related problem will be to programmatically identify the type of collision(?).