Closed marco-brandizi closed 1 year ago
I've added this new module, which contains a skeleton of what we have to do and how we can arrange it.
Start from the core component. This can be completed by copy-pasting much code from KnetMiner, as explained above. Once a first version is ready, try it with the existing test and add assertions to the later to check that things work.
We'll work on these later, by first adding skeleton code to the project above and then completing them.
Remove the Lucene indexing from Knetminer
OndexServiceProvider.java
Inside the public void initData ( String configXmlPath )
Comment this 2 lines. This is moved to KnetMinerInitializer.java in knetminer-initializer module under Knetbuilder project //this.searchService.indexOndexGraph (); //this.semanticMotifDataService.initSemanticMotifData ();
@marco-brandizi Please review it, if anything more needs to handle here.
@jojicunnunni Yes, this is part of what needs to be replaced. ~exportGraphStats()
, which you asked me about today, is to be left on the Knetminer side~ (good catch, thanks) see below, we need it on KnetminerInitializer
.
See a few notes on the follow about the main things to do. Please, see the TODOs and try to have a look on your own to complete them.
KnetminerInitializer
as a Spring @Component
field attached to DataService
knetminerInitializer
(KMI is an abbreviation for this field/component I use below)DataService
OndexServiceProvider.initData ()
isInitializingData
), is to be keptKnetMinerInitializer.initKnetMinerData()
exportGraphStats()
, which does a few Knetminer-specific things (release notes and alike)~exportGraphStats()
to KnetMinerInitializer
, because from time to time, we have the need to export these statistics (in XML format) outside Knetminer.
saveGraphStats( path )
). initKnetMinerData( opts )
should have a new option like graphStatsPath
and this should be used if
not null. It's fine to leave the plug-in and the CLI wrapper as-is (they can setup any option via options
) DataService
, OndexServiceProvider
and related components with equivalents coming from the initialiser, egdataService.getGraph()
as a delegate of knetminerInitializer.getGraph()
dataService.getOptions()
same method in the KMIgetOXLPath()
, which are already wrappers, redirected to the KMIisReferenceGenome()
, just keep the current code, replace
this.options.*
with knetminerInitializer.getOptions().*
@jojicunnunni, on exportGraphStats()
I've reviewed this, we need to move it into the initialiser, see above.
@jojicunnunni See the code, I've added various comments (with prefix TODO:newConfig
) regarding how things should be aligned to the new configuration system. I think I touched all major points, but I might have missed some (eg, I didn't go through the command line package, but the changes needed there are quite clearly implied by the plugin it invokes).
Mostly, it's about replacing the old way of getting the options, based on a Java Map
, with the new way, based on the KnetminerConfig
object. Be aware that the traverser used to take its pertinent options directly from the old Map
options
(as a subset of them), now it needs to receive KnetminerConfig.getGraphTraverserOptions()
, which is still a Map
. The Knetminer code base has already the new version of things like this.
I cleaned and completed this, and then moved it to Knetminer. We had circular dependency issues with keeping this in Ondex, so we also have had to dismiss the wrapper to invoke this component as an Ondex plug-in. It will only be available from the web app (via programmatic invocation) or as command line tool (which reloads the OXL to work with).
This is not closed, the web application needs to use this component and be cleaned of duplicated code. Also, we need to start using the CLI on real ETL pipelines.
@marco-brandizi Could you please outline the changes , need to do for using this component in web application
I believe this is done? Please close if so @marco-brandizi
The Semantic Motif Initialiser
Summary
Introduction
The KnetMiner Gene Explorer is based on knowledge graphs, that is, datasets that are created and managed by means of our KnetBuilder framework, also named Ondex. This framework has two major component types:
ONDEXGraph
allows for keeping in RAM a graph ofONDEXConcept
(s) (ie, nodes), linked together by binaryRelation
(s) (ie, edges). Concepts and relations have a set of properties, such as their data source or a list of key/value attributes.In addition to loading a (configured) OXL, the web app does much additional initialisation work, which, mostly consist of:
As said above, currently, both the Lucene indexing and the traverser are invoked against a given dataset when the KnetMiner web application is started, ie, when its Docker container is started, which triggers the start of its Tomcat server, which starts the API/WS
.war
application (see the KnetMiner wiki for details).After the traversal stage, The traverser output is saved on disk and the web application avoids to redo the whole operation again at each restart, if these output files are found on a configured location. This is possible because the result of the traversal operation is always the same for a given dataset/graph, and the web application uses these data in read-only mode. Similarly, the Lucene indexing is skipped if the corresponding Lucene directory is found under a configured path.
Despite the latter optimisations, the web application spends a lot of time into this initialisation stage, which could be moved offline, so that we could be able to create traversal data once for all, during the creation of the dataset by means of the KnetBuilder workflow system (aka, Ondex Mini).
So, the purpose of this document is moving the traversal-invoking code that currently is inside the KnetMiner web service to KnetBuilder, by developing proper wrappers to invoke it from the KnetBuilder framework (details below).
Some details about the traversing
This is to get a better understanding of the context for this hereby task, not strictly needed here.
Technically, we have a generic GraphTraverser interface and a (still) default implementation, which is based on the state machine model. Recently, we have started migrating to a Cypher-based implementation, which relies on both on in-memory data encoded via Ondex components and on the same data stored in a Neo4j database. Which traverser flavour to use is decided via KnetMiner configuration.
Requirements
We need a component based on the architecture of many Ondex plug-ins, that is:
For the moment do not create a direct dependency on the Cypher traverser artifact, the component should allow for the choice of this specific traverser (or the default) by means of a string representing its FQN. We'll look at how to organise this dependency later (eg, optional download in the workflow binary).
A good (and recent) reference for the architecture outlined above is the graph descriptor component. In particular, note the details:
SemanticMotifInitializer
plus a method likeinit()
(note I use American spelling when naming code units).ONDEXExport
, since this is the one closest to the meaning of the component at issue. Its implementation won't be much different than anyway the descriptor example anyway..zip
that, among other files, contains the/lib
directory with all the needed runtime.jar
s, and one or more.sh
wrappers to invoke the corresponding CLI class (copy-paste fromoxl-descriptor.sh
for the new .sh). See here for info on how these Ondex clients are arranged.The KnetMiner code
What the new component has to do can be seen (and widely copy-pasted) from the current KnetMiner code. Namely:
OndexServiceProvider.initData( <path> )
DataService.initGraph()
is invoked. This loads the OXL dataset in memory, in anONDEXGraph
field. In the new component, this graph will be a class field (see below).loadGraph()
code to get an idea of howParser.loadOXL()
is used for this.UIUtils.removeOldGraphAttributes ( graph )
is a KnetMiner-specific operation, not relevant here.genomeGenesCount
is instead needed (see below), so the new component will need to do this, reusing the current code.SearchService.indexOndexGraph()
is invoked. This creates a Lucene index of many parts of the OXL graph, which is then used by KnetMiner to perform fast keyword-based searches. The new component needs to do the same. As you can see, this method gets info from thegraph
field and the data path defined in the options file.semanticMotifDataService.initSemanticMotifData ()
. This initialises the traverser, using options above, and then invokes it. After that, results are saved into files, in the form of serialised Java objects. We need to replicate/move all the code you see in this method.doReset
flag. Best choice I see for this is to replicate the same publicinitSemanticMotifData( doReset )
method in the main classSemanticMotifInitializer
for the new component. This because KnetMiner has a CypherDebugger component, used for testing and debugging purposes, which needs to trigger the data reinitialisation from this step only.exportService.exportGraphStats()
, which saves an XML file of statistics, obtained from both the OXL and the traversal results (which are used for visualisations like this). This has to be replicated to the new component too.The initialiser options
The new component needs the same options/parameters that KnetMiner uses in the code described in the previous section. So, our new
SemanticMotifInitializer
will have this:ONDEXGraph
to work with. Simplest solution is to make the component stateful and mantain agraph
field for this, together with other stuff (eg,genomeGenesCount
mentioned above).init()
method. In the plug-in, this will be a an argument of typeFileArgument
.init()
implementation as a private bare method, accepting the parameteroptions
of typeOptionsMap
. Then define the public wrapper using the option's file path.OndexGraph
. The plug-in has already the graph field for this, so no other addition is needed. In the case of the CLI, this should be the-i
/--oxl
option. This should be an optional parameter, which, when defined, should override theDataFile
option, found in the options file.DataPath
option in the options file. In the plug-in, should be aFileArgument
(withisDirectory
== true) and in the CLI it should correspond to the-o
/--data-dir
option.