GH-2535: Always use hash joins when joining VALUES blocks

Aklakan commented 3 weeks ago

GitHub issue resolved #2535

Pull request Description: Update to JoinClassifier such that joins between tables are not linearized. The downside is, that this PR adds extra passes over the algebra tree to JoinClassifier in order to check for the operand types.

[x] Tests are included.
[x] Commits have been squashed to remove intermediate development commit messages.
[x] Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.

See the Apache Jena "Contributing" guide.

arne-bdt commented 1 week ago

This looks amazing, but it is also at the heart of the SPARQL engine. Could there be cases where joins are slower now than before? Do we need regression benchmarks for join performance? I simply don't feel confident enough to review this PR.

Aklakan commented 1 week ago

Do we need regression benchmarks for join performance?

My intention was to use this PR as a basis for regression testing of #2405, where the benchmark queries contain VALUE blocks that join on more than one variable. However, without this PR, the (non-indexed) linear joins are so slow that a workaround is needed to enforce hash joins. For example, hash joins are enforced by wrapping the right-hand-side with a LIMIT:

    // The following extra test cases for TestClassify.java are GREEN _without_ this PR:

    @Test public void testClassify_Join_values_plain() {
        TestClassify.classifyJ(
        """
        {
          VALUES ?z { 0 }
          VALUES ?z { 1 }
        }
        """,
        true); // true means linear join
    }

    @Test public void testClassify_Join_values_wrapped() {
        TestClassify.classifyJ(
        """
        {
          VALUES ?z { 0 }
          { SELECT * { VALUES ?z { 1 } } LIMIT 1000000000 }
        }
        """,
        false); // false means no linear join (i.e.executed as hash join)
    }

With this PR, the workaround is no longer needed.

Could there be cases where joins are slower now than before?

That's the central question, and it would be good to get feedback on this :)

So far I could not think of any good examples. The PR introduces a slight overhead for checking the operand types but I'd say this is outweighed by the overall performance improvement for queries that inline data using VALUE blocks.

As a further improvement for this PR, instead of adding the logic to detect table-operands to JoinClassifier, it could also be moved to TransformJoinStrategy - In that case, only if the original JoinClassifier returned "linear join" then an extra check for table operands would be made that could change the verdict to "hash join".

apache / jena

GH-2535: Always use hash joins when joining VALUES blocks #2536