Closed cuddihyge closed 2 years ago
@ptival I could use another pair of eyes on the optimizer, as I haven't been able to convince myself it does anything.
I have downloaded apache-jena-4.4.0 and gotten the .bat files to work. (can't get the bash versions to resolve jars!)
I've done this:
cd whatever/run/databases/RACK
C:\Users\200001934\apache-jena-4.4.0\bat\tdbstats --loc=. --graph urn:x-arq:UnionGraph > temp.opt
move temp.opt to stats.opt
And I've run different versions of the query with different optimizations based on tdbstats quickstart docs
But I can't convince myself anything slows down or speeds up the query. Are you willing to take an independent shot at this?
I do not believe SemTK should do a lot of optimization in its query generator since triplestores are supposed to do this and they all do it differently.
I have found that:
Also see #717 as I've asked @tuxji to help understand why it is slow=60seconds on my machine, but slow=NEVER for Abha and Kit in their docker containers. I'll mention @glguy because I'll bet you're curious about this whole issue.
Tried Blazegraph. It crashes on one of the newer auto-generated SemTK queries that retrieves the ontology. Sigh. Opened an issue on the Blazegraph github.
Aborted mission to test Blazegraph.
Query that fails is in ontologyInfo.java "select distinct ?Property ?Domain ?Range ...."
Fixed SemTK to work with Blazegraph and the LM query now returns in a few seconds. Note that time to load the ingestion package is longer. Also note that this is still a terrible query.
But we do have (opensource!) Blazegraph as a backup when we see performance problems. There are many additional performance optimizations on Blazegraph that I have not tried.
Will open two other small-ish issues to maintain SemTK and RACK cli compatibility with Blazegraph
We can't use Blazegraph without downloading and compiling with the Log4j issue fixed. This seems like a bad idea.
However Blazegraph's fast execution of the query does prove that Apache Jena Fuseki is doing something very very sub-optimal.
Blazegraph demo is working on my laptop. Includes CONSTRUCT.
(Blazegraph does NOT contain log4j v2. However it is 3yrs out of any updates including dependabot security pushes)
@Ptival has figured out that the Jena query optimizer does not work with multiple FROM clauses.
Although I could adjust the SPARQL generator to use FROM NAMED and GRAPH, this would change the meaning of queries across multiple graphs such that it is equivalent to running the entire query once for each graph (and disallowing matches that required some triples from each graph).
This seems like a big road block for multiple graphs and Jena. Leaving us with:
Hopefully we're wrong and we can accomplish (1). @ptival seems to be on a roll. Can you convince yourself and us whether there is any way to do (1) and optimize a query that requires triples across multiple graphs (NOT the entire query repeated once per each graph). I would imagine that wrapping every single clause in GRAPH ?g { <clause> }
might work, but that is very awkward and probably uses the optimizer but to no effect.
To be more precise, the optimizer gets thwarted by the presence of any FROM
clause (even just one, unfortunately).
I will think about whether there is a solution for (1) that does not involve doing lots of graph-scoping in the query. I agree with you that wrapping every single clause ought to work, but that sounds unpleasant.
We created a longer term #737 issue : can /should SemTK try optimizing using predicate counts
I now have a local version of SemTK (which I will check in eventually since it is stable and low risk) that accepts the graph "urn:x-arq:DefaultGraph" as the default graph in any connection string. The sparql generator will not generate a FROM or USING clause if a connection only references the default graph. This will allow us to run some tests.
I've also hacked a local copy of the rack cli (which I most definitely will NOT check in) that
This allows me to run any ingestion scripts without modification and it all goes to the default graph.
Next step: run some query tests and optimizations
@Ptival I got it to work (!)
Loading everything into default graph at @Ptival 's instruction leads do a reduction from 45 sec to 8 sec. Running tdbstats to RACK/Data-0001/stats.opt reduces further to sub 1 sec.
Summary of how I needed to get this working
# First load into default graph, then:
$ export JENA_HOME=~/apache-jena-4.5.0
$ export PATH=$PATH:$JENA_HOME/bin
$ cd $JENA_HOME/lib
$ java -cp "*" tdb2.tdbstats --loc ~/apache-jena-fuseki-4.5.0/run/databases/RACK > ~/apache-jena-fuseki-4.5.0/run/databases/RACK/Data-0001/temp.opt
$ cd ~/apache-jena-fuseki-4.5.0/run/databases/RACK/Data-0001
$ mv temp.opt stats.opt
Changes we'd need to make to get this all working:
RE: query_certgate_TopLevel-Claims-from-AcertReqModels in v10.2 against all three data graphs
Separate from the fact that the query is bad. It is slow. 1) can we further investigate the jena query optimizer 2) are there any changes to sparql generation