marco-brandizi commented 2 years ago

The Semantic Motif Initialiser

Summary

Introduction
Requirements
- The KnetMiner code
- The initialiser options

Introduction

The KnetMiner Gene Explorer is based on knowledge graphs, that is, datasets that are created and managed by means of our KnetBuilder framework, also named Ondex. This framework has two major component types:

a set core components that allow for dealing with knowledge graphs (details below). Namely, ONDEXGraph allows for keeping in RAM a graph of ONDEXConcept(s) (ie, nodes), linked together by binary Relation(s) (ie, edges). Concepts and relations have a set of properties, such as their data source or a list of key/value attributes.
a plug-in system, which can be used to define and run data workflows to read various data sources (eg, CSV tables, XML files, web APIs in JSON format) and translate these data into the above graph components. The typical workflow starts with initialising an empty graph, then multiple data parser/loader plug-ins populate the graph, finally the graph is saved from memory to an OXL file (ie, an XML format based on our own schema, which reflects the graph components above). OXL files are then loaded into the KnetMiner web app (using an OXL's parser), to serve the application functionality.

In addition to loading a (configured) OXL, the web app does much additional initialisation work, which, mostly consist of:

A Lucene-based index is created on-disk to ease keyword-based searches over the OXL graph elements (eg, searching nodes by name or identifier)
A graph traversal step, which employs semantic motifs, ie, graph patterns, to navigate graph paths from genes until relevant entities. Details can be found here.

As said above, currently, both the Lucene indexing and the traverser are invoked against a given dataset when the KnetMiner web application is started, ie, when its Docker container is started, which triggers the start of its Tomcat server, which starts the API/WS .war application (see the KnetMiner wiki for details).

After the traversal stage, The traverser output is saved on disk and the web application avoids to redo the whole operation again at each restart, if these output files are found on a configured location. This is possible because the result of the traversal operation is always the same for a given dataset/graph, and the web application uses these data in read-only mode. Similarly, the Lucene indexing is skipped if the corresponding Lucene directory is found under a configured path.

Despite the latter optimisations, the web application spends a lot of time into this initialisation stage, which could be moved offline, so that we could be able to create traversal data once for all, during the creation of the dataset by means of the KnetBuilder workflow system (aka, Ondex Mini).

So, the purpose of this document is moving the traversal-invoking code that currently is inside the KnetMiner web service to KnetBuilder, by developing proper wrappers to invoke it from the KnetBuilder framework (details below).

Some details about the traversing

This is to get a better understanding of the context for this hereby task, not strictly needed here.
Technically, we have a generic GraphTraverser interface and a (still) default implementation, which is based on the state machine model. Recently, we have started migrating to a Cypher-based implementation, which relies on both on in-memory data encoded via Ondex components and on the same data stored in a Neo4j database. Which traverser flavour to use is decided via KnetMiner configuration.

Requirements

We need a component based on the architecture of many Ondex plug-ins, that is:

A core component, containing the functionality to invoke the traverser. This should contain the bare minimum to execute this functionality, possibly, other components should stay elsewhere.
An Ondex plug-in wrapper, which defines the available options for the traverser (arguments, in the jargon of the Ondex plug-ins), and invokes the core component above. This will be used in Ondex Mini workflow, presumably with a graph that was build by previous steps (ie, other plug-ins) in a workflow.
A command-line (CLI) interface, which is another wrapper to the core. This should load an OXL file from CLI parameters and then pass it to the core traverser component above.

For the moment do not create a direct dependency on the Cypher traverser artifact, the component should allow for the choice of this specific traverser (or the default) by means of a string representing its FQN. We'll look at how to organise this dependency later (eg, optional download in the workflow binary).

A good (and recent) reference for the architecture outlined above is the graph descriptor component. In particular, note the details:

the core component (tests and usage examples). The shape of the new traversal component needs to be agreed, eg, SemanticMotifInitializer plus a method like init() (note I use American spelling when naming code units).
the plug-in. The new plug-in should subclass ONDEXExport, since this is the one closest to the meaning of the component at issue. Its implementation won't be much different than anyway the descriptor example anyway.
the CLI interface
- Note that recently we started using the picocli library, which is one of the best around.
  - Also, note that the CLI component is a separated Maven project, since this has to include much stuff that isn't needed in the core package.
  - Moreover, see the CLI POM and the Maven Assembly descriptor file for a reference on how to organise the build of the final command line tool binary (this is a a CLI package containing multiple tools/commands, not just the graph descriptor tool). The two components spawns a .zip that, among other files, contains the /lib directory with all the needed runtime .jars, and one or more .sh wrappers to invoke the corresponding CLI class (copy-paste from oxl-descriptor.sh for the new .sh). See here for info on how these Ondex clients are arranged.

The KnetMiner code

What the new component has to do can be seen (and widely copy-pasted) from the current KnetMiner code. Namely:

Current entry point it OndexServiceProvider.initData( <path> )
This loads all application an traverser options from the (XML) property file it gets passed. This path will be a parameter for the new component (see below).
After the options loading, DataService.initGraph() is invoked. This loads the OXL dataset in memory, in an ONDEXGraph field. In the new component, this graph will be a class field (see below).
- See the loadGraph() code to get an idea of how Parser.loadOXL() is used for this.
- UIUtils.removeOldGraphAttributes ( graph ) is a KnetMiner-specific operation, not relevant here.
- The init of genomeGenesCount is instead needed (see below), so the new component will need to do this, reusing the current code.
The, SearchService.indexOndexGraph() is invoked. This creates a Lucene index of many parts of the OXL graph, which is then used by KnetMiner to perform fast keyword-based searches. The new component needs to do the same. As you can see, this method gets info from the graph field and the data path defined in the options file.
After the indexing, we have semanticMotifDataService.initSemanticMotifData (). This initialises the traverser, using options above, and then invokes it. After that, results are saved into files, in the form of serialised Java objects. We need to replicate/move all the code you see in this method.
- We also need to support the doReset flag. Best choice I see for this is to replicate the same public initSemanticMotifData( doReset ) method in the main class SemanticMotifInitializer for the new component. This because KnetMiner has a CypherDebugger component, used for testing and debugging purposes, which needs to trigger the data reinitialisation from this step only.
Finally, we have exportService.exportGraphStats(), which saves an XML file of statistics, obtained from both the OXL and the traversal results (which are used for visualisations like this). This has to be replicated to the new component too.

The initialiser options

The new component needs the same options/parameters that KnetMiner uses in the code described in the previous section. So, our new SemanticMotifInitializer will have this:

The ONDEXGraph to work with. Simplest solution is to make the component stateful and mantain a graph field for this, together with other stuff (eg, genomeGenesCount mentioned above).
The path to an options file. This contains several details needed to run the traverser (and KnetMiner), some are generic, some are traverser-specific. See an example in this template. In the core component, this will be a parameter of the init() method. In the plug-in, this will be a an argument of type FileArgument.
- In the core component, define the init() implementation as a private bare method, accepting the parameter options of type OptionsMap. Then define the public wrapper using the option's file path.
The input OXL to work with. In the core component, this is a parameter of type OndexGraph. The plug-in has already the graph field for this, so no other addition is needed. In the case of the CLI, this should be the -i/--oxl option. This should be an optional parameter, which, when defined, should override the DataFile option, found in the options file.
The path of the output directory. This is where the traverser results have to be stored. Similarly to the input path, this should override the DataPath option in the options file. In the plug-in, should be a FileArgument (with isDirectory == true) and in the CLI it should correspond to the-o/--data-dir option.

marco-brandizi commented 2 years ago

Skeleton for the core component (first step)

I've added this new module, which contains a skeleton of what we have to do and how we can arrange it.

Start from the core component. This can be completed by copy-pasting much code from KnetMiner, as explained above. Once a first version is ready, try it with the existing test and add assertions to the later to check that things work.

Next steps

We'll work on these later, by first adding skeleton code to the project above and then completing them.

Write the plug-in wrapper
Write the command line client.

jojicunnunni commented 2 years ago

Remove the Lucene indexing from Knetminer

OndexServiceProvider.java

Inside the public void initData ( String configXmlPath )

Comment this 2 lines. This is moved to KnetMinerInitializer.java in knetminer-initializer module under Knetbuilder project //this.searchService.indexOndexGraph (); //this.semanticMotifDataService.initSemanticMotifData ();

@marco-brandizi Please review it, if anything more needs to handle here.

marco-brandizi commented 2 years ago

@jojicunnunni Yes, this is part of what needs to be replaced. ~exportGraphStats(), which you asked me about today, is to be left on the Knetminer side~ (good catch, thanks) see below, we need it on KnetminerInitializer.

See a few notes on the follow about the main things to do. Please, see the TODOs and try to have a look on your own to complete them.

Notes on removing the Knetminer code that was migrated to the initialiser component

Having an intialiser reference

We need an instance of KnetminerInitializer as a Spring @Component field attached to DataService
Name this like knetminerInitializer (KMI is an abbreviation for this field/component I use below)
TODO: which visibility? Probably private, with just a few delegating methods exposed in DataService

`OndexServiceProvider.initData ()`

The try/finally block (which coordinates the parallel initialisation, via isInitializingData), is to be kept
The code inside is to be replaced by KnetMinerInitializer.initKnetMinerData()
~except exportGraphStats(), which does a few Knetminer-specific things (release notes and alike)~
We need to migrate exportGraphStats() to KnetMinerInitializer, because from time to time, we have the need to export these statistics (in XML format) outside Knetminer.
- We need two flavours of this method: one that returns a string, another that saves the string into a file (saveGraphStats( path )).
- initKnetMinerData( opts ) should have a new option like graphStatsPath and this should be used if not null. It's fine to leave the plug-in and the CLI wrapper as-is (they can setup any option via options)

Bridging methods

We need to bridge (ie, write delegates) methods in DataService, OndexServiceProvider and related components with equivalents coming from the initialiser, eg
dataService.getGraph() as a delegate of knetminerInitializer.getGraph()
dataService.getOptions() same method in the KMI
Methods like getOXLPath(), which are already wrappers, redirected to the KMI
TODO: find what else needs the same treatment
Most of this is temporary, to be replaced by the new initialisation mechanism. That's why delegates are probably a quick-n-dirt solution that, for now, is better than exposing the KMI and changing all the invocations around of these methods into invocations of the KMI
For a few methods that don't have equivalents on KMI, eg, isReferenceGenome(), just keep the current code, replace this.options.* with knetminerInitializer.getOptions().*

marco-brandizi commented 2 years ago

@jojicunnunni, on exportGraphStats() I've reviewed this, we need to move it into the initialiser, see above.

marco-brandizi commented 2 years ago

@jojicunnunni See the code, I've added various comments (with prefix TODO:newConfig) regarding how things should be aligned to the new configuration system. I think I touched all major points, but I might have missed some (eg, I didn't go through the command line package, but the changes needed there are quite clearly implied by the plugin it invokes).

Mostly, it's about replacing the old way of getting the options, based on a Java Map, with the new way, based on the KnetminerConfig object. Be aware that the traverser used to take its pertinent options directly from the old Map options (as a subset of them), now it needs to receive KnetminerConfig.getGraphTraverserOptions(), which is still a Map. The Knetminer code base has already the new version of things like this.

marco-brandizi commented 1 year ago

I cleaned and completed this, and then moved it to Knetminer. We had circular dependency issues with keeping this in Ondex, so we also have had to dismiss the wrapper to invoke this component as an Ondex plug-in. It will only be available from the web app (via programmatic invocation) or as command line tool (which reloads the OXL to work with).

marco-brandizi commented 1 year ago

This is not closed, the web application needs to use this component and be cleaned of duplicated code. Also, we need to start using the CLI on real ETL pipelines.

jojicunnunni commented 1 year ago

@marco-brandizi Could you please outline the changes , need to do for using this component in web application

Arnedeklerk commented 1 year ago

I believe this is done? Please close if so @marco-brandizi

Rothamsted / knetbuilder

Specification for an offline Graph Traverser and Semantic Motif Initialiser #55

The Semantic Motif Initialiser

Summary

Introduction

Some details about the traversing

Requirements

The KnetMiner code

The initialiser options

Skeleton for the core component (first step)

Next steps

Notes on removing the Knetminer code that was migrated to the initialiser component

Having an intialiser reference

`OndexServiceProvider.initData ()`

Bridging methods