callahantiff / PheKnowLator

PheKnowLator: Heterogeneous Biomedical Knowledge Graphs and Benchmarks Constructed Under Alternative Semantic Models
https://github.com/callahantiff/PheKnowLator/wiki
Apache License 2.0
159 stars 29 forks source link

Handling MESH-CHEBI Mappings #77

Closed callahantiff closed 3 years ago

callahantiff commented 3 years ago

TASK

Task Type: CODEBASE

Decide how to handle MESH to CHEBI mappings. Currently there is a GitHub Gist (ncbo_rest_api.py) that pings the BioPortal API into a script that can be run as part of the KG CI/CD build.

Problems: The ncbo_rest_api.py script runs fine, but it's brittle given its reliance on the BioPortal API, which is notoriously unstable. A potential solution (for now or in the future) could be implement the LOOM algorithm which is what creates the mappings underlying the API.

TODO

callahantiff commented 3 years ago

This work impacts issue #72 because of its reference in the associated Jupyter Notebook.

callahantiff commented 3 years ago

@bill-baumgartner - this is complete (will be integrated with PR #81). I followed the details for the LOOM algorithm described on the BioPortal Wiki. It's very simple, just a few methods. Since there is nothing fancy, essentially accomplished through some preprocessing of the input MesH and ChEBI data and performing an inner join to find overlapping concepts.

In a Nutshell: We download the mesh2021.nt data file directly from MeSH and the Flat_file_tab_delimited/names.tsv.gz file directly from ChEBI. Using these files, we have recapitulated the LOOM algorithm implemented by BioPortal when creating mappings between these resources. The procedure is relatively straightforward and utilizes the following information from each resource:

You can see details with a description in the notebook here under ChEBI Identifiers as well as in the scripted version of this notebook (lines: 496-628, here)