LEMMING is an ExaMple MImickiNg graph Generator
The classes responsible to load the RDF graphs are under the package org.aksw.simba.lemming.creation
. The graphs are first read from file and are then converted to coloured graphs by GraphCreator.java
.
The mimic graph is initialized based on the target graph's metrics. All the generator types are located under org.aksw.simba.lemming.mimicgraph.generator
.
In GraphOptimization.java
, two graphs are created by adding and removing an edge from the generated graph. The error score is then computed for these two graphs and the one with the lowest error score is chosen for the next iteration until either the number of maximum iterations has been reached or no improvement is found on the graph for the past 5 000 iterations.
The optimized graph is finalized as a real-world RDF graph in GraphLexicalization.java
by rendering all the resources' IRIs.
Place the files present in https://hobbitdata.informatik.uni-leipzig.de/lemming/resources.zip
and in the Input graphs/
folder of https://hobbitdata.informatik.uni-leipzig.de/lemming/Experiments_data.zip
under lemming's directory.
First, the metrics need to be computed on all available graphs of its corresponding dataset: Experiments_data/Input graphs
. The pre-computation can be achieved by indicating the dataset through:
mvn exec:java -Dexec.mainClass="org.aksw.simba.lemming.tools.PrecomputingValues" -Dexec.args="pg"
This will produce a file named value_store.val
to be used during graph generation. It is recommended to move/rename the previous metrics store before re-running the store generation.
Parameter | Required | Default | Description |
---|---|---|---|
-ds | True | NA | Dataset {pg, swdf, lgeo, geology} |
-nv | True | NA | Desired number of vertices in the generated graph (number of vertices of the target graph) |
-t | False | R | Type of graph generator {R, RR, C, CD, D, DD} |
-l | False | Initialized_MimicGraph.ser | File path where to save the initialized mimic graph. If a graph already exists there, the mimic graph generation will be skipped and loaded from file instead. |
-s | False | System.currentTimeMillis() | Seed for results reproduction. |
-thrs | False | availableProcessors*4 | Number of threads |
-op | False | 50 000 | Number of optimization iterations |
To run the graph generation, you can use maven's plugin:
mvn exec:java -Dexec.mainClass="org.aksw.simba.lemming.tools.GraphGenerationTest" -Dexec.args="-ds pg -nv 792923 -t R -op 30000"
To run the graph generation for the baseline generator, use:
mvn exec:java -Dexec.mainClass="org.aksw.simba.lemming.tools.BuildBaselineGraph" -Dexec.args="-ds pg -nv 792923"
You should move the target graph before starting the graph generation. The target graph is also called held-out graph, it's usually the latest graph of the versioned dataset.
From the metrics pre-computation step, you can get the number of vertices of the target graph. This will serve as an input to the graph generation. Below is a table with the currently accepted datasets and the number of vertices of its target graph.
Dataset | No. vertices | Folder | Description | Target graph |
---|---|---|---|---|
pg | 792 923 | PersonGraph/ | Person Graph (subset of DBpedia) | 2016-10 |
swdf | 45 420 | SemanticWebDogFood/ | Semantic Web Dog Food | 2015 |
lgeo | 591 649 | LinkedGeoGraphs/ | Linked Geo Data | 2015 |
geology | 1 281 | GeologyGraphs/ | International Chronostratigraphic Chart | 2018-1 |
You can use our script to generate the graphs for all generator types by specifying the dataset: ./run_dataset.sh pg
. Before starting/switching datasets, make sure you have the right value_store.val
file.
The metrics and constant expressions values can be found in LemmingEx.result
.
The triple stores benchmark was done through IGUANA on Virtuoso, Apache Jena Fuseki, GraphDB and Blazegraph triple stores. You can find the queries used for each dataset under Experiments_data/IGUANA experiments/queries
. The benchmarking should be run for each of the generated graphs and the target graph. Please note that the target graph in this step should be the pre-processed one (after type inference and materialization).
IGUANA produces a N-Triple file with the metrics of interest: Query Mixes Per Hour (QMPH), No. Queries Per Hour (NoQPH) and Queries Per Second (QPS).
We also have scripts to manage the lifecycle of the triple stores, as well as upload the graphs to the triple store and starting IGUANA. The scripts may need changes depending on the location of triple stores binary files/installation. To use them, you need to specify the folder where the graphs are located:
./exec_all.sh /home/lemming/generated_graphs/
Internally, Lemming is using the Grph library.
For testing, we are using the email-Eu-core network published by the Stanford University. It has been transformed into a simple RDF file.
The Lemming logo has been created by TortugaAttack.
@inproceedings{roeder2021lemming,
author = {R{\"o}der, Michael and Nguyen, Pham Thuy Sy and Conrads, Felix and da Silva, Ana Alexandra Morim and Ngomo, Axel-Cyrille Ngonga},
booktitle = {Proceedings of the 15th IEEE International Conference on Semantic Computing (ICSC)},
doi = {10.1109/ICSC50631.2021.00015},
pages = {62-69},
publisher = {IEEE Computer Society},
title = {LEMMING -- Example-based Mimicking of Knowledge Graphs},
url = {https://doi.org/10.1109/ICSC50631.2021.00015},
year = 2021
}