RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Create a meta KG for ARAX that is a union of its KPs' meta KGs #1879

Open amykglen opened 2 years ago

amykglen commented 2 years ago

TRAPI 1.3 recommends but does not require that we provide a meta_knowledge_graph for ARAX that is a union of all of its KPs' meta_knowledge_graphs: https://github.com/NCATSTranslator/ReasonerAPI/commit/e2ed87aa4f02dac55dcbd8eac7e190b8c188fbdd

maybe this union meta KG should be automatically generated and periodically updated by the KPSelector module, since that's already where we pull all KPs' meta KGs and cache info from them.

we're also supposed to indicate in the meta KG for which meta edges ARAX can answer knowledge_type: 'inferred' queries: https://github.com/NCATSTranslator/ReasonerAPI/pull/333/files

amykglen commented 2 years ago

leaving some notes for @rcpeene here since I'm going to be out for a bit (back Aug. 31) - things to do/figure out for this issue:

edeutsch commented 2 years ago

since our KPs now differ depending on maturity level, does each maturity/branch need its own meta KG? (@edeutsch can likely answer this)

Yes, I suppose yes. But I was sort of under the impression that each instance/endpoint has its own meta KG? When I restart an ARAX endpoint and do a test query, it seems to go through a process of checking the meta kgs of all its KGs. I was sort of thinking that the output of that process would/could be a merged meta KG?

So perhaps the meta KG should be computed dynamically by each endpoint as it manages its KPs as it already does? And thus maturity/branch is not really relevant except insofar as different endpoints will access different KP endpoints because of that?

I may not be understanding the situation well.

rcpeene commented 2 years ago

Yes, I think that is correct @edeutsch. Unless we decide otherwise, the meta KGs that are examined and used will be based on the KP endpoints that our instance decides to access (which will be constrained on the basis of version and maturity). I believe KPSelector is the class that checks each KP's meta KG in the way you're referring to. I intend to the put the logic that creates an ARAX meta-KG there. The result would be that each ARAX instance has a different meta-kg.json stored somewhere that we decide.

edeutsch commented 2 years ago

this may lead to different endpoints that should have the same meta kg having somewhat different ones. But such are the risks of such a distributed system. I think it's the best way. it would be good to document the caching strategy so we're all aware of it

rcpeene commented 2 years ago

I have the logic implemented which fetches each available KP's meta-KG and makes a large super-meta-KG for ARAX, and stores it in a file meta_kg.json, in ARAX/ARAXQuery/Expand. This logic merges meta-nodes that have the same node key by producing a set union of their id_prefixes property, and creates a concatenated list of their respective attributes lists. For meta-edges, it combines ones that have the same subject--object--predicate triple by creating a concatenated list of their respective attributes lists. There is not yet logic to handle the knowledge_types property of meta-edges.

It's worth noting that the resulting ARAX meta-KG is very large; 76 meta-nodes and 48,526 meta-edges.

rcpeene commented 2 years ago

To make explicit the caching mechanism that this system currently uses; it piggybacks off of the mechanism that KPSelector uses to load the "meta-map". A new meta-kg.json is made and written anytime the meta-map is refreshed. This happens if the meta-map hasn't been refreshed for more than 24 hours or if the existing meta-map doesn't contain some KPs which were found from Smart API at the onset of the query. In other words, the meta-kg is recreated at least every 24 hours, and also any time a new valid KP is found in the Smart API registry.

Two additional notes:

edeutsch commented 2 years ago

thanks, this is a good explanation. What is the performance impact of this? i.e. how long does it take to do this rebuilding?

edeutsch commented 2 years ago

It's worth noting that the resulting ARAX meta-KG is very large; 76 meta-nodes and 48,526 meta-edges.

What is the size of RTX-KG2 metaKG alone?

rcpeene commented 2 years ago

RTX-KG2's meta-KG contains 57 meta-nodes and 45,2813 meta-edges, as of my last check, making up a significant majority of ARAX's meta-KG. As for time performance, the duration of the rebuilding process is highly variable since it depends on many requests to KPs. It seems as though the process of building the Meta-KG takes about 10 seconds, and a similar 10 seconds for refreshing the meta-map

rcpeene commented 2 years ago

I've added a bit more logic to remove 'null' properties that don' need to exist in the Meta-KG, and to properly assign values to the knowledge_types property of meta-edges. After my meeting with @dkoslicki, my understanding is that most meta-edges in ARAX should have only 'lookup' (default) as their knowledge_types values. The exceptions are meta-edges that have a subject--predicate--object triple of the following form ChemicalMixture--ameliorates--DiseaseOrPhenotypicTrait, or DiseaseOrPhenotypicTrait--is_ameliorated_by--ChemicalMixture. This includes subject, object, and predicate values that are descended from the categories I just mentioned. Meta-edges with these triples have both 'lookup' and 'inferred' as their knowledge_types. Unless someone else has feedback, I think the Meta-KG is complete with this implementation.

rcpeene commented 2 years ago

Further discussion; It looks like the Meta-KG creation process is much longer than I originally estimated. It takes >30 seconds in most of the tests I ran. After talking to @amykglen, we decided to change the caching mechanism so that it only gets updated once every 24 hours.

rcpeene commented 2 years ago

Code has been tested and pushed. A PR has been issued.

saramsey commented 2 years ago

Should this be marked high priority? Just wondering what else might be gating on this issue.

amykglen commented 2 years ago

it's just waiting on me to review/merge. though it's not high priority, since it's an optional feature. I'll merge it soon

amykglen commented 2 years ago

one thing I think we should do before merging issue1879 is combining the two requests to get KPs' meta KGs. it looks like right now there's one request to get info for Expand's 'meta map' and another to build this new meta KG, even though the building of those two things is paired; think we should combine these since it's somewhat time-consuming to get all KPs' meta KGs.