ge-high-assurance / RACK

DARPA's Automated Rapid Certification of Software (ARCOS) project called Rapid Assurance Curation Kit (RACK)
BSD 3-Clause "New" or "Revised" License
20 stars 6 forks source link

LM "string query" is very slow - Optimize #718

Closed cuddihyge closed 2 years ago

cuddihyge commented 2 years ago

RE: query_certgate_TopLevel-Claims-from-AcertReqModels in v10.2 against all three data graphs

        FROM <http://rack001/mitre-cwe>
        FROM <http://rack001/nist-800-53>
        FROM <http://rack001/model>

Separate from the fact that the query is bad. It is slow. 1) can we further investigate the jena query optimizer 2) are there any changes to sparql generation

cuddihyge commented 2 years ago

@ptival I could use another pair of eyes on the optimizer, as I haven't been able to convince myself it does anything.

I have downloaded apache-jena-4.4.0 and gotten the .bat files to work. (can't get the bash versions to resolve jars!)

I've done this:

cd whatever/run/databases/RACK
C:\Users\200001934\apache-jena-4.4.0\bat\tdbstats --loc=.  --graph urn:x-arq:UnionGraph > temp.opt
move temp.opt to stats.opt

And I've run different versions of the query with different optimizations based on tdbstats quickstart docs

But I can't convince myself anything slows down or speeds up the query. Are you willing to take an independent shot at this?

cuddihyge commented 2 years ago

I do not believe SemTK should do a lot of optimization in its query generator since triplestores are supposed to do this and they all do it differently.

I have found that:

cuddihyge commented 2 years ago

Also see #717 as I've asked @tuxji to help understand why it is slow=60seconds on my machine, but slow=NEVER for Abha and Kit in their docker containers. I'll mention @glguy because I'll bet you're curious about this whole issue.

cuddihyge commented 2 years ago

Tried Blazegraph. It crashes on one of the newer auto-generated SemTK queries that retrieves the ontology. Sigh. Opened an issue on the Blazegraph github.

Aborted mission to test Blazegraph.

Query that fails is in ontologyInfo.java "select distinct ?Property ?Domain ?Range ...."

cuddihyge commented 2 years ago

Fixed SemTK to work with Blazegraph and the LM query now returns in a few seconds. Note that time to load the ingestion package is longer. Also note that this is still a terrible query.

But we do have (opensource!) Blazegraph as a backup when we see performance problems. There are many additional performance optimizations on Blazegraph that I have not tried.

cuddihyge commented 2 years ago

Will open two other small-ish issues to maintain SemTK and RACK cli compatibility with Blazegraph

cuddihyge commented 2 years ago

We can't use Blazegraph without downloading and compiling with the Log4j issue fixed. This seems like a bad idea.

However Blazegraph's fast execution of the query does prove that Apache Jena Fuseki is doing something very very sub-optimal.

cuddihyge commented 2 years ago

Blazegraph demo is working on my laptop. Includes CONSTRUCT.

(Blazegraph does NOT contain log4j v2. However it is 3yrs out of any updates including dependabot security pushes)

cuddihyge commented 2 years ago

@Ptival has figured out that the Jena query optimizer does not work with multiple FROM clauses.

Although I could adjust the SPARQL generator to use FROM NAMED and GRAPH, this would change the meaning of queries across multiple graphs such that it is equivalent to running the entire query once for each graph (and disallowing matches that required some triples from each graph).

This seems like a big road block for multiple graphs and Jena. Leaving us with:

  1. Can we figure out how to get Jena to optimize queries that need triples from different graphs
  2. Do we change the entire multiple graph strategy
  3. Do we switch to a triplestore that doesn't have this problem (e.g. Blazegraph has other problems, but not this one)

Hopefully we're wrong and we can accomplish (1). @ptival seems to be on a roll. Can you convince yourself and us whether there is any way to do (1) and optimize a query that requires triples across multiple graphs (NOT the entire query repeated once per each graph). I would imagine that wrapping every single clause in GRAPH ?g { <clause> } might work, but that is very awkward and probably uses the optimizer but to no effect.

Ptival commented 2 years ago

To be more precise, the optimizer gets thwarted by the presence of any FROM clause (even just one, unfortunately).

I will think about whether there is a solution for (1) that does not involve doing lots of graph-scoping in the query. I agree with you that wrapping every single clause ought to work, but that sounds unpleasant.

cuddihyge commented 2 years ago

We created a longer term #737 issue : can /should SemTK try optimizing using predicate counts

cuddihyge commented 2 years ago

I now have a local version of SemTK (which I will check in eventually since it is stable and low risk) that accepts the graph "urn:x-arq:DefaultGraph" as the default graph in any connection string. The sparql generator will not generate a FROM or USING clause if a connection only references the default graph. This will allow us to run some tests.

cuddihyge commented 2 years ago

I've also hacked a local copy of the rack cli (which I most definitely will NOT check in) that

This allows me to run any ingestion scripts without modification and it all goes to the default graph.

Next step: run some query tests and optimizations

cuddihyge commented 2 years ago

@Ptival I got it to work (!)

Loading everything into default graph at @Ptival 's instruction leads do a reduction from 45 sec to 8 sec. Running tdbstats to RACK/Data-0001/stats.opt reduces further to sub 1 sec.

cuddihyge commented 2 years ago

Summary of how I needed to get this working

# First load into default graph, then:
$  export JENA_HOME=~/apache-jena-4.5.0
$  export PATH=$PATH:$JENA_HOME/bin
$ cd $JENA_HOME/lib
$  java -cp "*" tdb2.tdbstats --loc ~/apache-jena-fuseki-4.5.0/run/databases/RACK > ~/apache-jena-fuseki-4.5.0/run/databases/RACK/Data-0001/temp.opt
$ cd ~/apache-jena-fuseki-4.5.0/run/databases/RACK/Data-0001
$ mv temp.opt stats.opt

Changes we'd need to make to get this all working: