Rothamsted / knetbuilder

KnetBuilder data integration platform for building knowledge graphs. Previously known as ondex.
https://knetminer.com
MIT License
12 stars 11 forks source link

Accession based mapper fails with NullPointerException #29

Closed josephhearnshaw closed 3 years ago

josephhearnshaw commented 4 years ago

The following error is met:

Error: Could not open and parse the workflow file. See stack trace for details.
Expanded workflow can be found at: /tmp/ontologies.xml1459162435658455958expanded
java.lang.NullPointerException: Query must not be null
        at java.util.Objects.requireNonNull(Objects.java:228)
        at org.apache.lucene.search.BooleanClause.<init>(BooleanClause.java:60)
        at org.apache.lucene.search.BooleanQuery$Builder.add(BooleanQuery.java:138)
        at net.sourceforge.ondex.core.searchable.LuceneQueryBuilder.searchConceptByConceptAccessionExact(LuceneQueryBuilder.java:607)
        at net.sourceforge.ondex.core.searchable.LuceneQueryBuilder.searchConceptByConceptAccessionExact(LuceneQueryBuilder.java:563)
        at net.sourceforge.ondex.mapping.lowmemoryaccessionbased.Mapping.start(Mapping.java:239)
        at net.sourceforge.ondex.workflow.engine.Engine.runMapping(Engine.java:396)
        at net.sourceforge.ondex.workflow.engine.PluginProcessor$4.run(PluginProcessor.java:128)
        at net.sourceforge.ondex.workflow.engine.PluginProcessor$4.run(PluginProcessor.java:126)
        at net.sourceforge.ondex.workflow.engine.PluginProcessor.execute(PluginProcessor.java:83)
        at net.sourceforge.ondex.workflow.engine.BasicJobImpl.run(BasicJobImpl.java:110)
        at net.sourceforge.ondex.WorkflowMain.main(WorkflowMain.java:216)
        at net.sourceforge.ondex.OndexMiniMain.main(OndexMiniMain.java:7)

This is using ondex mini 3.0 release and a certain OXL file as input.

Test workflow and data are in the knetminer share under test/mapping_bug/git_issue_29/

marco-brandizi commented 4 years ago

The problem is that Lucene doesn't like accessions like "-" or other strings made of punctuation marks (or spaces) only. I've added a better error message, which reports the field and the searched string. This way, the collapser still fails, but at least with more details. I could make it to go on (ignoring the wrong accession), but I think failing is better, cause accessions with such values are 99% originated from some error and it safer to fix them. If '-' is to say 'no accession', the entry shouldn't be there at all.

(the following ones are internal notes for Knetminer developers)

The new code is under (...)/software/ondex-desktop and requires Java 11 (the master branch requires that now), it's not worth to retrofit this little change into the 3.0 (since the problem is mainly in the data).

I've defined a new launching script in (...)/test/mapping_bug/git_issue_29/launch.sbatch. @josephhearnshaw have a look for tips on how to improve those script, eg, usage of relative paths.

By the way, I don't think -Xmx set to the same value of #SBATCH --mem can work, since the total memory for the submitted process needs room for both the JVM and its heap (and the bytecode, and other areas I don't remember), ie, the limit passed to the JVM has to be smaller, or you risk that SLURM kills the process.

As a general consideration, the old Lucene code in Ondex is a pain, I'm tempted to rewrite plug-ins like the mapper based on simpler hashmaps (it uses Lucene just to search accessions by identity). I'll see if these errors keep happening.