Open Aklakan opened 3 weeks ago
This looks amazing, but it is also at the heart of the SPARQL engine. Could there be cases where joins are slower now than before? Do we need regression benchmarks for join performance? I simply don't feel confident enough to review this PR.
Do we need regression benchmarks for join performance?
My intention was to use this PR as a basis for regression testing of #2405, where the benchmark queries contain VALUE blocks that join on more than one variable. However, without this PR, the (non-indexed) linear joins are so slow that a workaround is needed to enforce hash joins. For example, hash joins are enforced by wrapping the right-hand-side with a LIMIT:
// The following extra test cases for TestClassify.java are GREEN _without_ this PR:
@Test public void testClassify_Join_values_plain() {
TestClassify.classifyJ(
"""
{
VALUES ?z { 0 }
VALUES ?z { 1 }
}
""",
true); // true means linear join
}
@Test public void testClassify_Join_values_wrapped() {
TestClassify.classifyJ(
"""
{
VALUES ?z { 0 }
{ SELECT * { VALUES ?z { 1 } } LIMIT 1000000000 }
}
""",
false); // false means no linear join (i.e.executed as hash join)
}
With this PR, the workaround is no longer needed.
Could there be cases where joins are slower now than before?
That's the central question, and it would be good to get feedback on this :)
So far I could not think of any good examples. The PR introduces a slight overhead for checking the operand types but I'd say this is outweighed by the overall performance improvement for queries that inline data using VALUE blocks.
As a further improvement for this PR, instead of adding the logic to detect table-operands to JoinClassifier, it could also be moved to TransformJoinStrategy - In that case, only if the original JoinClassifier returned "linear join" then an extra check for table operands would be made that could change the verdict to "hash join".
GitHub issue resolved #2535
Pull request Description: Update to JoinClassifier such that joins between tables are not linearized. The downside is, that this PR adds extra passes over the algebra tree to JoinClassifier in order to check for the operand types.
By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.
See the Apache Jena "Contributing" guide.