blazegraph / database

Blazegraph High Performance Graph Database
GNU General Public License v2.0
898 stars 173 forks source link

Efficient Sparql query containing "filter not exists" #198

Closed Craphtex closed 3 years ago

Craphtex commented 3 years ago

I got some issues making efficient querying of data containing "filter not exists", and hope someone have an idea of how to make it run without errors. My description and data below contains some blockchain jargon, if that makes it hard to understand I'll translate that all to carrots and potatoes if desired.

Software OS: Ubuntu 18.04.5 LTS BlazeGraph: 2.1.5 (via docker openkbs/blazegraph-docker)

Hardware CPU: AMD Ryzen 7 1800X (8 core @ 3.6GHz) RAM: 4x16GiB (DDR4 @ 2133 MHz) GPU: TITAN Xp (12GiB GDDR5X)

Data I've loaded about 120'000'000 tuples of blockchain data using TTL files in the following format

# Blocks
c:B3 a ct:Block ;
  ct:number "1"^^xsd:integer ;
  ct:epoch "0"^^xsd:integer ;
  ct:timestamp "2017-09-23 21:44:51"^^xsd:dateTime .
# Transaction
c:T14506 a ct:Transaction ;
  ct:fee "0"^^xsd:integer ;
  ct:deposit "0"^^xsd:integer ;
  ct:block c:B3316 .
# Outputs
c:O25816 a ct:Output ;
  ct:amount "93148937194773"^^xsd:integer ;
  ct:producedBy c:T22726 .
# Inputs
c:O1953708 a ct:Output ;
  ct:amount "2247872000"^^xsd:integer ;
  ct:consumedBy c:T889104 .

Efficient query This is very direct query and resolves very very quickly. It fetches some meta data given an output id (here c:O54321).

SELECT ?amount ?time ?epoch
WHERE {
  c:O54321 a ct:Output ;
          ct:amount ?amount ;
          ct:producedBy/ct:block ?block .
  ?block a ct:Block ;
         ct:timestamp ?time ;
         ct:epoch ?epoch .
}

Inefficient query (without filter) This query is doing the most restrictive thing on the last line, and it resolves consistently in about a minute. It fetches all outputs created in a given epoch (here 1).

SELECT ?out ?amount ?time
WHERE {
  ?out a ct:Output ;
         ct:amount ?amount ;
         ct:producedBy/ct:block ?block .
  ?block a ct:Block ;
         ct:timestamp ?time ;
         ct:epoch 1 .
}

Inefficient query (with filter) This query though contains a "filter not exists" early in the query, and it is too slow to execute. It fetches all outputs with some meta data for all outputs not consumed by a transaction (this is called a UTxO set in blockchain land).

SELECT ?out ?amount ?time ?epoch
WHERE {
  ?out a ct:Output .
  FILTER NOT EXISTS { ?out ct:consumedBy ?_ }
  ?out :amount ?amount ;
       ct:producedBy/ct:block ?block .
  ?block a ct:Block ;
         ct:timestamp ?time ;
         ct:epoch ?epoch .
}
LIMIT 5

The issue When I ran the "Inefficient query (with filter)" above I got varying results from time to time. Sometimes it resolved in a minute, other times it resolved after 15, or crashing with a "java.lang.OutOfMemoryError: Java heap space" error (after about 30 minutes).

This I tried to solve by giving BlzaGraph more ram, which I did by changing JVM_OPTS in conf/blacegraph from -Xmx4g to -Xmx32g. Like this:

...
JVM_OPTS="-Djava.awt.headless=true -server -Xmx32g -XX:MaxDirectMemorySize=3000m -XX:+UseG1GC"
...

Now however now I instead get an error (still when running the "Inefficient query (with filter)", the "Efficient query" and "Inefficient query (without filter)" still work fine) stating

ERROR: 

(nothing more and nothing less)

I'm a bit stuck here and on the brink of giving up on BlazeGraph in favor of trying something else, so any nudge in the right directing would be appreciated. Anything from a better way of writing the query, to restructuring my data.

Thanks in advance! :pray:

thompsonbry commented 3 years ago

One place to look is the EXPLAIN of the plans. Blazegraph supports both the explain of a fully executed plan and the live view of the explain of a currently running plan. The wiki has information on how to interpret and use the explain. Among other things, this should help you understand where the memory burden is appearing.

Bryan

On Tue, Mar 9, 2021 at 23:47 Christian Lindgren @.***> wrote:

I got some issues making efficient querying of data containing "filter not exists", and hope someone have an idea of how to make it run without errors. My description and data below contains some blockchain jargon, if that makes it hard to understand I'll translate that all to carrots and potatoes if desired.

Software OS: Ubuntu 18.04.5 LTS BlazeGraph: 2.1.5 (via docker openkbs/blazegraph-docker)

Hardware CPU: AMD Ryzen 7 1800X (8 core @ 3.6GHz) RAM: 4x16GiB (DDR4 @ 2133 MHz) GPU: TITAN Xp (12GiB GDDR5X)

Data I've loaded about 120'000'000 tuples of blockchain data using TTL files in the following format

Blocks

c:B3 a ct:Block ;

ct:number "1"^^xsd:integer ;

ct:epoch "0"^^xsd:integer ;

ct:timestamp "2017-09-23 21:44:51"^^xsd:dateTime .

Transaction

c:T14506 a ct:Transaction ;

ct:fee "0"^^xsd:integer ;

ct:deposit "0"^^xsd:integer ;

ct:block c:B3316 .

Outputs

c:O25816 a ct:Output ;

ct:amount "93148937194773"^^xsd:integer ;

ct:producedBy c:T22726 .

Inputs

c:O1953708 a ct:Output ;

ct:amount "2247872000"^^xsd:integer ;

ct:consumedBy c:T889104 .

Efficient query This is very direct query and resolves very very quickly. It fetches some meta data given an output id (here c:O54321).

SELECT ?amount ?time ?epoch WHERE {

c:O54321 a ct:Output ;

      ct:amount ?amount ;

      ct:producedBy/ct:block ?block .

?block a ct:Block ;

     ct:timestamp ?time ;

     ct:epoch ?epoch .

}

Inefficient query (without filter) This query is doing the most restrictive thing on the last line, and it resolves consistently in about a minute. It fetches all outputs created in a given epoch (here 1).

SELECT ?out ?amount ?time WHERE {

?out a ct:Output ;

     ct:amount ?amount ;

     ct:producedBy/ct:block ?block .

?block a ct:Block ;

     ct:timestamp ?time ;

     ct:epoch 1 .

}

Inefficient query (with filter) This query though contains a "filter not exists" early in the query, and it is too slow to execute. It fetches all outputs with some meta data for all outputs not consumed by a transaction (this is called a UTxO set in blockchain land).

SELECT ?out ?amount ?time ?epoch WHERE {

?out a ct:Output .

FILTER NOT EXISTS { ?out ct:consumedBy ?_ }

?out :amount ?amount ;

   ct:producedBy/ct:block ?block .

?block a ct:Block ;

     ct:timestamp ?time ;

     ct:epoch ?epoch .

} LIMIT 5

The issue When I ran the "Inefficient query (with filter)" above I got varying results from time to time. Sometimes it resolved in a minute, other times it resolved after 15, or crashing with a "java.lang.OutOfMemoryError: Java heap space" error (after about 30 minutes).

This I tried to solve by giving BlzaGraph more ram, which I did by changing JVM_OPTS in conf/blacegraph from -Xmx4g to -Xmx32g. Like this:

...

JVM_OPTS="-Djava.awt.headless=true -server -Xmx32g -XX:MaxDirectMemorySize=3000m -XX:+UseG1GC"

...

Now however now I instead get an error (still when running the "Inefficient query (with filter)", the "Efficient query" and "Inefficient query (without filter)" still work fine) stating

ERROR:

(nothing more and nothing less)

I'm a bit stuck here and on the brink of giving up on BlazeGraph in favor of trying something else, so any nudge in the right directing would be appreciated. Anything from a better way of writing the query, to restructuring my data.

Thanks in advance! 🙏

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YBUDP2ABUFDTML4JO3TC4PYFANCNFSM4Y5M6R3Q .

Craphtex commented 3 years ago

A bit belated thanks for the pointer to the EXPLAIN feature. After quite a bit of head scratching I've restructured the data which has sped things up.

A word of caution for anyone out there (and especially my future self), a query like

...
  FILTER NOT EXISTS { ?out ct:consumedBy ?_ }
...
LIMIT 5

when there are no vertexes matching the NOT EXISTS constraint will take very long to execute.