MillenniumDB / WDBench

Benchmark resources
10 stars 5 forks source link

WDBench

In this repository you can find the data files and queries used in the benchmarking section for WDBench.

Table of contents

Wikidata data

The data used in this benchmark is based on the Wikidata Truthy from 2021-06-23. We cleaned the data removing all triples whose predicate is not a direct property (i.e http://www.wikidata.org/prop/direct/P*). The data is available to download from Figshare.

The script to generate these data from the original data is in our source folder.

Data loading

Data loading for Apache Jena

1. Prerequisites

Apache Jena requires Java JDK (we used Openjdk 11, other versions might work as well)

The installation may be different depending on your Linux distribution. For Debian/Ubuntu based distributions:

2. Download Apache Jena

You can download Apache Jena from their website . The file you need to download will look like apache-jena-4.X.Y.tar.gz, in our case, we used the version 4.1.0, but this should also work for newer versions.

3. Extract and change into the project folder

4. Execute the bulk import

5. Import for leapfrog version

This step is necessary only if you want to use the Leapfrog Jena implementation, you can skip this otherwise.

Edit the text file bin/tdbloader2index and search for the lines:

generate_index "$K1 $K2 $K3" "$DATA_TRIPLES" SPO

generate_index "$K2 $K3 $K1" "$DATA_TRIPLES" POS

generate_index "$K3 $K1 $K2" "$DATA_TRIPLES" OSP

After those lines add:

generate_index "$K1 $K3 $K2" "$DATA_TRIPLES" SOP

generate_index "$K2 $K1 $K3" "$DATA_TRIPLES" PSO

generate_index "$K3 $K2 $K1" "$DATA_TRIPLES" OPS

Now you can execute the bulk import in the same way we did it before:

In order to be able to run the benchmark for Leapfrog Jena you also need to use a custom fuseki-server.jar

Data loading for Virtuoso

1. Edit the .nt

Virtuoso has a problem with geo-datatypes so we generated a new .nt file to prevent them from being parsed as a geo-datatype.

2. Download Virtuoso

You can download Virtuoso from their github. We used Virtuoso Open Source Edition, version 7.2.6.

3. Create configuration file

4. Load the data

Data loading for Blazegraph

1. Prerequisites

You'll need the following prerequisites installed:

The installation may be different depending on your Linux distribution. For Debian/Ubuntu based distributions:

2. Split .nt file into smaller files

Blazegraph can't load big files in a reasonable time, so we need to split the .nt into smaller files (1M each)

3. Clone the Git repository and build

4. Edit the default script

5. Load the splitted data

Data loading for Neo4J

1. Download Neo4J

2. Edit configuration file

Edit the text file conf/neo4j.conf

3. Convert .nt to .csv files

Use the script nt_to_neo4j.py to generate the .csv files entities.csv, literals.csv and edges.csv

4. Bulk import and index

Execute the data import (the command supposes that .csv files are in the wikidata_csv folder)

bin/neo4j-admin import --database wikidata \
 --nodes=Entity=wikidata_csv/entities.csv \
 --nodes wikidata_csv/literals.csv \
 --relationships wikidata_csv/edges.csv \
 --delimiter "," --array-delimiter ";" --skip-bad-relationships true

Now we have to create the index for entities:

Wikidata Queries

In this benchmark we have 5 sets of queries:

We provide the SPARQL queries in our queries folder. Also we provide the equivalent cypher property paths (it has fewer queries because some property paths cannot be expressed in cypher).

Single BGPs, Multiple BGPs and Property Paths are based on real queries extracted from the Wikidata SPARQL query log. This log contains millions of queries, but many of them are trivial to evaluate. We thus decided to generate our benchmark from more challenging cases, i.e., a smaller log of queries that timed-out on the Wikidata public endpoint. From these queries we extracted their BGPs and property paths removing duplicates (modulo isomorphism on query variables). Then we filtered with the same criteria that we applied to the data, removing all queries having predicates that are not a direct property (http://www.wikidata.org/prop/direct/P*). Next, for property paths we removed queries that have both subject and object as variables and for BGPs we removed queries having a triple in which subject, predicate and object are variables. Finally, we distinguish BGPs queries consisting of a single triple pattern (Single BGPs) from those containing more than one triple pattern (Multiple BGPs).

Running the benchmark

Here we provide a description of the scripts we used for the execution of the queries.

Our scripts will execute a list of queries for a certain engine, one at a time, and register the time and number of results for each query in a csv file.

Every time you want to run a benchmark script you must clear the cache of the system before. To do this, run as root:

Then you have 2 scripts for executing the benchmarks, one for SPARQL engines and another for NEO4J. They are placed in the Execution folder.

Each script has a parameters section near the beginning of the file, (e.g. database paths, output folder) make sure to edit the script to set them properly.

To execute the benchmark for a SPARQL engine you need to pass 4 parameters:

  1. the engine name (JENA, BLAZEGRAPH or VIRTUOSO).
  2. the limit for the queries
  3. the absolute path to the query file.
  4. Any name you want to give as a prefix for the output file.

E.g.

To execute the benchmark for NEO4J you need to manually start the server after clearing the cache. Then you can execute the script passing 3 parameters:

  1. the path (absolute or relative) to the query file.
  2. the limit for the queries
  3. Any name you want to give as a prefix for the output file.

E.g.