RobokopU24 / NewSourceProposals

New Knowledge Providers (KPs) for the Data Management Oversight Group (DMOG) to review
0 stars 0 forks source link

Add in Rare Disease information from MONDO #24

Open DnlRKorn opened 11 months ago

DnlRKorn commented 11 months ago

There was a blog post from MONARCH detailing how to parse the MONDO data to get "rare" disease classification https://mondo.monarchinitiative.org/pages/analysis/

Jim Balhoff rewrote this blog post as a series of SparQL queries for UBERGRAPH

Number 1: https://api.triplydb.com/s/TZb0JA_r2 Number 2: https://api.triplydb.com/s/UJxL_XfoR Number 3: https://api.triplydb.com/s/-aEWFIHZH Number 4: https://api.triplydb.com/s/DdeJYT_YL

Need to figure out smart way to run SparQL queries when running ORION and integrating this info into the final graph.

Also open question of where this should live, as part of MONDO parsing or elsewhere.

eKathleenCarter commented 2 months ago

Review at DMOG 4/17

EvanDietzMorris commented 2 months ago

We don't currently have a real MONDO parser, we get everything we get from MONDO from Ubergraph.

The only thing we do get from MONDO is node properties, the idea being to include exactly the kind of thing Dan wanted, but it really needs to be completely refactored because:

  1. The MONDOProps parser was written before we completely refactored the other Ubergraph parsers. It's very inefficient. It doesn't have real Ubergraph versioning, which is available in the real Ubergraph parser/utils, it just uses a modify date.

  2. We need to review the syntax of these properties because right now it's a made up thing where we take the MONDO designation and turn it into something like this: {"MONDO_SUPERCLASS_rare": True}.

  3. From a quick glance it looks like we only get one "rare" disease (MONDO:0021136). This seems odd and might be due to a bug in the current parsing technique.

We do actually have other SPARQL queries in the Ubergraph tools and we could probably use the queries mentioned above to do this much more efficiently. We should also consider whether the MONDO properties should just be a part of Ubergraph (but this means you couldn't easily apply them to other graphs) and if we want other stuff from MONDO.

eKathleenCarter commented 2 months ago

1) Improve MONDO parser 2) Talk to Jim 3) "rare" disease designation is in custom format and not biolink term.

EvanDietzMorris commented 2 months ago

More specifically we should:

  1. Refactor the MONDO parser completely to use the Ubergraph utils associated with the Ubegraph parsers, and probably use SPARQL queries to extract the MONDO properties instead of iterating through everything. Using SPARQL queries on Jim's live deployment is pretty different than downloading and using our own instance, which is typically the ORION way, but avoids massive memory consumption. Currently we do attach these properties onto nodes and all of their ancestors as designated in Ubergraph, so we need to make sure that still works.
  2. Confirm with Jim that all of the MONDO disease classifications we want are actually in Ubergraph. Otherwise we might need to go to MONDO directly anyway.