Rothamsted / knetbuilder

KnetBuilder data integration platform for building knowledge graphs. Previously known as ondex.
https://knetminer.com
MIT License
12 stars 11 forks source link

Specification for an offline Graph Traverser and Semantic Motif Initialiser #55

Closed marco-brandizi closed 1 year ago

marco-brandizi commented 2 years ago

The Semantic Motif Initialiser

Summary

Introduction

The KnetMiner Gene Explorer is based on knowledge graphs, that is, datasets that are created and managed by means of our KnetBuilder framework, also named Ondex. This framework has two major component types:

In addition to loading a (configured) OXL, the web app does much additional initialisation work, which, mostly consist of:

As said above, currently, both the Lucene indexing and the traverser are invoked against a given dataset when the KnetMiner web application is started, ie, when its Docker container is started, which triggers the start of its Tomcat server, which starts the API/WS .war application (see the KnetMiner wiki for details).

After the traversal stage, The traverser output is saved on disk and the web application avoids to redo the whole operation again at each restart, if these output files are found on a configured location. This is possible because the result of the traversal operation is always the same for a given dataset/graph, and the web application uses these data in read-only mode. Similarly, the Lucene indexing is skipped if the corresponding Lucene directory is found under a configured path.

Despite the latter optimisations, the web application spends a lot of time into this initialisation stage, which could be moved offline, so that we could be able to create traversal data once for all, during the creation of the dataset by means of the KnetBuilder workflow system (aka, Ondex Mini).

So, the purpose of this document is moving the traversal-invoking code that currently is inside the KnetMiner web service to KnetBuilder, by developing proper wrappers to invoke it from the KnetBuilder framework (details below).

Some details about the traversing

This is to get a better understanding of the context for this hereby task, not strictly needed here.
Technically, we have a generic GraphTraverser interface and a (still) default implementation, which is based on the state machine model. Recently, we have started migrating to a Cypher-based implementation, which relies on both on in-memory data encoded via Ondex components and on the same data stored in a Neo4j database. Which traverser flavour to use is decided via KnetMiner configuration.

Requirements

We need a component based on the architecture of many Ondex plug-ins, that is:

For the moment do not create a direct dependency on the Cypher traverser artifact, the component should allow for the choice of this specific traverser (or the default) by means of a string representing its FQN. We'll look at how to organise this dependency later (eg, optional download in the workflow binary).

A good (and recent) reference for the architecture outlined above is the graph descriptor component. In particular, note the details:

The KnetMiner code

What the new component has to do can be seen (and widely copy-pasted) from the current KnetMiner code. Namely:

The initialiser options

The new component needs the same options/parameters that KnetMiner uses in the code described in the previous section. So, our new SemanticMotifInitializer will have this:

marco-brandizi commented 2 years ago

Skeleton for the core component (first step)

I've added this new module, which contains a skeleton of what we have to do and how we can arrange it.

Start from the core component. This can be completed by copy-pasting much code from KnetMiner, as explained above. Once a first version is ready, try it with the existing test and add assertions to the later to check that things work.

Next steps

We'll work on these later, by first adding skeleton code to the project above and then completing them.

jojicunnunni commented 2 years ago

Remove the Lucene indexing from Knetminer

OndexServiceProvider.java

Inside the public void initData ( String configXmlPath )

Comment this 2 lines. This is moved to KnetMinerInitializer.java in knetminer-initializer module under Knetbuilder project //this.searchService.indexOndexGraph (); //this.semanticMotifDataService.initSemanticMotifData ();

@marco-brandizi Please review it, if anything more needs to handle here.

marco-brandizi commented 2 years ago

@jojicunnunni Yes, this is part of what needs to be replaced. ~exportGraphStats(), which you asked me about today, is to be left on the Knetminer side~ (good catch, thanks) see below, we need it on KnetminerInitializer.

See a few notes on the follow about the main things to do. Please, see the TODOs and try to have a look on your own to complete them.

Notes on removing the Knetminer code that was migrated to the initialiser component

Having an intialiser reference

OndexServiceProvider.initData ()

Bridging methods

marco-brandizi commented 2 years ago

@jojicunnunni, on exportGraphStats() I've reviewed this, we need to move it into the initialiser, see above.

marco-brandizi commented 2 years ago

@jojicunnunni See the code, I've added various comments (with prefix TODO:newConfig) regarding how things should be aligned to the new configuration system. I think I touched all major points, but I might have missed some (eg, I didn't go through the command line package, but the changes needed there are quite clearly implied by the plugin it invokes).

Mostly, it's about replacing the old way of getting the options, based on a Java Map, with the new way, based on the KnetminerConfig object. Be aware that the traverser used to take its pertinent options directly from the old Map options (as a subset of them), now it needs to receive KnetminerConfig.getGraphTraverserOptions(), which is still a Map. The Knetminer code base has already the new version of things like this.

marco-brandizi commented 1 year ago

I cleaned and completed this, and then moved it to Knetminer. We had circular dependency issues with keeping this in Ondex, so we also have had to dismiss the wrapper to invoke this component as an Ondex plug-in. It will only be available from the web app (via programmatic invocation) or as command line tool (which reloads the OXL to work with).

marco-brandizi commented 1 year ago

This is not closed, the web application needs to use this component and be cleaned of duplicated code. Also, we need to start using the CLI on real ETL pipelines.

jojicunnunni commented 1 year ago

@marco-brandizi Could you please outline the changes , need to do for using this component in web application

Arnedeklerk commented 1 year ago

I believe this is done? Please close if so @marco-brandizi