clulab / bioresources

Data resources from the biomedical domain
Apache License 2.0
3 stars 1 forks source link

Human protein synonyms from HGNC deprioritized #55

Closed bgyori closed 3 years ago

bgyori commented 3 years ago

I diagnosed a grounding issue which results in de-prioritized groundings for some human genes/proteins if a synonym is only available from HGNC, not UniProt. One example is ALK6, which appears as a synonym for BMPR1B from HGNC but not UniProt, while being a synonym for some non-human proteins e.g. https://www.uniprot.org/uniprot/P36898 in UniProt. Since matches to UniProt are prioritized relative to HGNC, what happens is that only the non-human matches for ALK6 are surfaced, and the match to the human protein is lost.

To solve, this, I am working on merging the HGNC-derived human gene/protein synonyms into uniprot-proteins.tsv.gz such that they will be pooled into any match derived from UniProt. This means that the HGNC-specific resource files and code can be removed here and in Reach. I will follow up with corresponding PRs.

MihaiSurdeanu commented 3 years ago

Nice catch. Thank you @bgyori !

I agree this is the simplest solution.