apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.08k stars 643 forks source link

Always use hash joins when joining VALUES blocks #2535

Open Aklakan opened 3 weeks ago

Aklakan commented 3 weeks ago

Version

5.1.0-SNAPSHOT

Feature

This is one more follow up towards support for multi variable join keys as triggered by the mail thread around https://www.mail-archive.com/users@jena.apache.org/msg20755.html

Related issues: #2412 and #2404

The mail mentions example sparql queries, such as the one below, that use value blocks to exemplify the issue with multi-variable joins.

# test.rq
select (count(*) as ?C) where {
  {
    select ?X ?Y (struuid() as ?UUID) where {
      values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
      values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
      bind ( ?X_i + 10 * ?X_j as ?X)
      values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
      values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
      bind ( ?Y_i + 10 * ?Y_j as ?Y) 
    } 
  } {
    select ?X ?Y where {
      {
        select ?X ?Y (rand() as ?RAND) where {
          values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
          values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
          bind ( ?X_i + 10 * ?X_j as ?X)
          values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
          values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
          bind ( ?Y_i + 10 * ?Y_j as ?Y) 
        } 
      } filter (?RAND < 0.95) 
    } 
  } 
}

Jena's JoinClassifier so far linearizes joins between values blocks, such as the given SPARQL example, which gets very slow for larger values blocks. An extra flag is necessary to force hash joins:

arq --explain --time --set arq:optIndexJoinStrategy=false --query test.rq

The goal of this issue is to modify JoinClassifier such that joins between table-based operands are automatically processed as hash joins rather than linear joins.

Are you interested in contributing a solution yourself?

Yes

Aklakan commented 3 weeks ago

Another use case is the SPARQL 4x4 Sudoku solver where the candidate field values are modeled in inline VALUES blocks. With hash joins, the example completes instantly. With linear joins the execution attempts to perform 4^16 = 4294967296 iterations.

PREFIX eg: <http://www.example.org/>

SELECT
  ?w11 ?w12 ?w13 ?w14
  ?w21 ?w22 ?w23 ?w24
  ?w31 ?w32 ?w33 ?w34
  ?w41 ?w42 ?w43 ?w44
WHERE {
  VALUES ?w11 { 1 2 3 4 } VALUES ?w12 { 1 2 3 4 } VALUES ?w13 { 1 2 3 4 } VALUES ?w14 { 1 2 3 4 }
  VALUES ?w21 { 1 2 3 4 } VALUES ?w22 { 1 2 3 4 } VALUES ?w23 { 1 2 3 4 } VALUES ?w24 { 1 2 3 4 }
  VALUES ?w31 { 1 2 3 4 } VALUES ?w32 { 1 2 3 4 } VALUES ?w33 { 1 2 3 4 } VALUES ?w34 { 1 2 3 4 }
  VALUES ?w41 { 1 2 3 4 } VALUES ?w42 { 1 2 3 4 } VALUES ?w43 { 1 2 3 4 } VALUES ?w44 { 1 2 3 4 }
FILTER(
  (?w11 != ?w12) && (?w11 != ?w13) && (?w11 != ?w14) && (?w12 != ?w13) && (?w12 != ?w14) && (?w13 != ?w14) &&
  (?w21 != ?w22) && (?w21 != ?w23) && (?w21 != ?w24) && (?w22 != ?w23) && (?w22 != ?w24) && (?w23 != ?w24) &&
  (?w31 != ?w32) && (?w31 != ?w33) && (?w31 != ?w34) && (?w32 != ?w33) && (?w32 != ?w34) && (?w33 != ?w34) &&
  (?w41 != ?w42) && (?w41 != ?w43) && (?w41 != ?w44) && (?w42 != ?w43) && (?w42 != ?w44) && (?w43 != ?w44) &&

  (?w11 != ?w21) && (?w11 != ?w31) && (?w11 != ?w41) && (?w21 != ?w31) && (?w21 != ?w41) && (?w31 != ?w41) &&
  (?w12 != ?w22) && (?w12 != ?w32) && (?w12 != ?w42) && (?w22 != ?w32) && (?w22 != ?w42) && (?w32 != ?w42) &&
  (?w13 != ?w23) && (?w13 != ?w33) && (?w13 != ?w43) && (?w32 != ?w33) && (?w23 != ?w43) && (?w33 != ?w43) &&
  (?w14 != ?w24) && (?w14 != ?w43) && (?w14 != ?w44) && (?w42 != ?w34) && (?w24 != ?w44) && (?w43 != ?w44) &&

  (?w11 != ?w22) && (?w12 != ?w21 ) &&
  (?w13 != ?w24) && (?w23 != ?w14 ) &&
  (?w31 != ?w42) && (?w32 != ?w41 ) &&
  (?w33 != ?w44) && (?w34 != ?w43 ) &&

  # Initial field
  (?w11 = 1 ) &&
  (?w22 = 2 ) &&
  (?w33 = 4 ) &&
  (?w44 = 3 )
)
}
rvesse commented 3 weeks ago

Another use case is the SPARQL 4x4 Sudoku solver where the candidate field values are modeled in inline VALUES blocks.

So I've got this hammer...

Aklakan commented 3 weeks ago

It's just an example that has been occasionally used in Semantic Web lectures and demos. So "academic exercise" is a better wording over "use case".

I recalled that some years ago I noted that this query does not work Jena when written like this; I wondered whether the proposed PR would make it work and the answer is yes - so I considered it worth mentioning.

Examples for Sudoku and SPARQL: [1] https://www.semantic-web-grundlagen.de/w/images/a/a6/GSW-Uebung4Loesung.pdf (page 9/10) [2] https://community.sap.com/t5/additional-blogs-by-members/semantic-web-technologies-part-1-sparql/ba-p/12903851 [3] https://stackoverflow.com/questions/44789385/solving-3x3-sudoku-with-sparql-and-owl