dbblumenthal / gedlib

An easily extensible C++ library for (suboptimally) computing the graph edit distance between attributed graphs.
GNU Lesser General Public License v3.0
57 stars 15 forks source link

GEDLIB (1.0)

1. About GEDLIB

GEDLIB is a C++ library for (suboptimally) computing edit distances between graphs using various state-of-the-art methods. GEDLIB allows you to build your graphs via the C++ API or load them from GXL files. Several benchmark datasets are distributed with GEDLIB. For these datasets, GEDLIB provides predefined edit cost functions. You can easily extend GEDLIB by implementing new edit cost functions or new methods for computing the graph edit distance. An extensive Doxygen documentation is availabe here.

2. License and Citing

The source code of GEDLIB is distributed under the GNU Lesser General Public License. If you want to use GEDLIB in a publication, please refer to the following papers:

3. Installation under Unix

GEDLIB uses the following external libraries:

After having installed CMake, Doxygen, OpenMP (under Mac OS), and, possibly, Gurobi, execute the script install.py for installing GEDLIB and the external libraries distributed with GEDLIB:

python install.py [--help] [-h] [--doc] [--tests <arg>] [--gurobi <GUROBI_ROOT>] [--debug] [--clean] [--lib gxl|<indentifier>,<UserNodeID>,<UserNodeLabel>,<UserEdgeLabel>]

If you execute install.py without any arguments, only the external libraries distributed with GEDLIB are installed.

4. Building an Application that Uses GEDLIB

4.1 Building an Application that Uses GEDLIB as Header-Only Library

For building an application that uses GEDLIB as a header-only library, it suffices to execute the installation script install.py without any options. Subsequently, carry out the following steps:

4.2 Building an Application that Uses GEDLIB as Shared Library

4.2.1 Graphs Given as GXL Files

If you want to build an application that uses GEDLIB as a shared for graphs given as GXL files, make sure that you have installed GEDLIB with the option --lib gxl. Subsequently, carry out the folowing steps:

4.2.2 Graphs with User-Defined Node ID, Node Label, and Edge Label Types

If you want to build an application that uses GEDLIB as a shared library for graphs with custom node ID, node label, and edge label types, make sure that you have installed GEDLIB with the option --lib <indentifier>,<UserNodeID>,<UserNodeLabel>,<UserEdgeLabel>. Subsequently, carry out the folowing steps:

5. Using GEDLIB

5.1 General Usage

In your source file, include the header src/env/ged_env.hpp and create your environment object:

#include "src/env/ged_env.hpp"
ged::GEDEnv<UserNodeID, UserNodeLabel, UserEdgeLabel> env;

All functionality of GEDLIB is accessible via the environment class ged::GEDEnv, whose template parameters UserNodeID, UserNodeLabel, and UserEdgeLabel must be set to the types of the node IDs, the node labels, and the edge labels of your graphs. Use the member functions ged::GEDEnv::add_graph(), ged::GEDEnv::add_node(), and ged::GEDEnv::add_edge(), and ged::GEDEnv::set_edit_costs() for adding graphs and edit costs to the environment. Once you're done with this, call ged::GEDEnv::init() to initialize the environment.

When you have initialized the environment, you are ready to run the GED methods. To select the method, call ged::GEDEnv::set_method(). You can then initialize the selected method via ged::GEDEnv::init_method(), and run it via ged::GEDEnv::run_method(). Note that most method can be run also without initialization, but doing so typically has a negative impact on both runtime and accuracy. Use the member functions ged::GEDEnv::get_lower_bound(), ged::GEDEnv::get_upper_bound(), ged::GEDEnv::get_node_map(), ged::GEDEnv::get_runtime(), ged::GEDEnv::get_init_time(), ged::GEDEnv::get_graph_class(), and ged::GEDEnv::get_graph_name() for accessing the results of your GED computations. Use ged::GEDEnv::quasimetric_costs() to check if your edit costs are quasimetric on the graphs contained in your environment.

5.2 Additional Functionality for Graphs Given as GXL Files

If you use GEDLIB with the template parameter UserNodeID set to ged::GXLNodeID a.k.a. std::string and the template parameters UserNodeLabel and UserEdgeLabel set to ged::GXLLabel a.k.a. std::map<std::string,std::string>, GEDLIB offers additional functionality for loading graphs given in the GXL file format. For those graphs, you do not have to use the member functions ged::GEDEnv::add_graph(), ged::GEDEnv::add_node(), and ged::GEDEnv::add_edge() for adding graphs to your environment. Instead, you can simply load all of them at once by one call to ged::GEDEnv::load_gxl_graphs().

5.3 Using GEDLIB as a Shared Library

5.3.1 Graphs Given as GXL Files

If you want to use GEDLIB as a shared library for graphs given as GXL files, make sure that you have installed GEDLIB with the option --lib gxl. In your source file, you have to define GXL_GEDLIB_SHARED before including src/env/ged_env.hpp:

  #define GXL_GEDLIB_SHARED  
  #include "src/env/ged_env.hpp"  
  ged::GEDEnv<ged::GXLNodeID, ged::GXLLabel, ged::GXLLabel> env  
  // ...

5.3.2 Graphs with User-Defined Node ID, Node Label, and Edge Label Types

If you want to use GEDLIB as a shared library for graphs with custom node ID, node label, and edge label types, make sure that you have installed GEDLIB with the option --lib <indentifier>,<UserNodeID>,<UserNodeLabel>,<UserEdgeLabel>. In your source file, you have to define <IDENTIFIER>_GEDLIB_SHARED before including src/env/ged_env.hpp, where <IDENTIFIER> is the upper case transformation of <identifier>. For example, if you have installed GEDLIB by executing $ python install.py --lib mytypes,int,double,double, you have to do the following:

  #define MYTYPES_GEDLIB_SHARED  
  #include "src/env/ged_env.hpp"  
  ged::GEDEnv<int, double, double> env  
  // ...

5.4 Examples

For an extensively commented example of how to use GEDLIB, have a look at median/tests/median_letter_demo.cpp. For more examples, see the .cpp files contained in the sub-directories of the tests/ directory.

6. Reproducability Packages

GEDLIB has been used for several research papers. For reproducing the experiments reported in these papers, follow the instructions below.

D. B. Blumenthal, S. Bougleux, J. Gamper, and L. Brun. “Ring based approximation of graph edit distance”, S+SSPR 2018, vol. 11004 of LNCS, pp. 293-303, https://doi.org/10.1007/978-3-319-97785-0_28

In order to reproduce the experiments reported in this paper, install GEDLIB with the option --tests sspr2018. After installation, open a shell and execute the following commands:

$ cd <GEDLIB_ROOT>/tests/sspr2018/bin
$ ./learn_ring_params
$ ./learn_subgraph_depths
$ ./learn_walks_depth
$ ./test_lsape_based_methods

After having executed these commands, the results of the experiments are contained in the folder tests/sspr2018/output/.

D. B. Blumenthal, N. Boria, J. Gamper, S. Bougleux, and L. Brun. “Comparing heuristics for graph edit distance computation”, VLDB J. 29(1), pp. 419-458, 2020, https://doi.org/10.1007/s00778-019-00544-1

In order to reproduce the experiments reported in this paper, install GEDLIB with the options --tests vldbj2020 and --gurobi <GUROBI_ROOT>. After installation, open a shell and execute the following commands:

$ cd <GEDLIB_ROOT>/tests/vldbj2020/bin
$ ./vldbj_train_subgraph
$ ./vldbj_train_walks
$ ./vldbj_train_ring
$ ./vldbj_train_ml
$ ./vldbj_test_lsape_based_methods
$ ./vldbj_test_lp_based_methods
$ ./vldbj_test_ls_based_methods
$ ./vldbj_test_misc_methods
$ ./vldbj_test_best_methods

After having executed these commands, the results of the experiments are contained in the folder tests/vldbj2020/results/. For creating TikZ figures and tables that visualize the results, run the script process_results.py.

7. Datasets

GEDLIB comes with several datassets which contain graphs given in the GXL file format. They are contained in the following subdirectories of the directory data/datasets/:

For each dataset, the directory data/collections/ contains an XML file which lists the contained graphs' GXL files along with their classes. These files match the document type definition data/collections/GraphCollection.dtd and can hence be used as input for ged::GEDEnv::load_gxl_graphs(). The Python script data/collections/sample.py can be used to generate samples of the datasets.

8. Directory Structure

After executing install.py, the directoy <GEDLIB_ROOT> has the following internal structure:

.
├── CMakeLists.txt --------------- used for building GEDLIB
├── doxyfile.in ------------------ used for creating the documentation
├── LICENSE.md ------------------- a copy of GNU LGPL
├── README.md -------------------- this file
├── install.py ------------------- installs GEDLIB
├── _data  
|   ├── _datasets ---------------- contains several datasets
|   |   └── ... 
|   └── _collections ------------- contains XML files that list the graphs in the datasets
|       ├── GraphCollection.dtd -- document type definition for graph collections 
|       ├── sample.py ------------ generates a sample from a collection file
|       └── ... 
├── _ext ------------------------- contains external libraries
|   └── ...
├── _docs ------------------------ contains documentation if GEDLIB has been installed with option --doc
|   └── ...
├── _lib ------------------------- contains shared libraries if GEDLIB has been installed with option --lib
|   └── ...
├── _src ------------------------- contains the sources of GEDLIB  
|   ├── CMakeLists.txt ----------- used for building GEDLIB
|   ├── _edit_costs -------------- contains edit costs for several datasets
|   |   └── ... 
|   ├── _env --------------------- contains the architecture of GEDLIB
|   |   ├── ged_env.hpp ---------- include this header in your application
|   |   └── ... 
|   ├── _methods ----------------- contains various methods for computing GED 
|   |   └── ... 
|   └── _util -------------------- contains utility classes and functions used by the GED methods
|       └── ... 
├── _median ---------------------- contains median graph computation, clustering, and indexing
|   ├── CMakeLists.txt ----------- used for building the executables
|   ├── _bin --------------------- contains the executables if GEDLIB has been built with option --median
|   |   └── ...
|   ├── _collections ------------- contains graph collections used by the median graph computation
|   |   └── ...
|   ├── _output ------------------ contains the median graphs once the executables have been run
|   |   └── ...
|   ├── _src --------------------- contains the median graphs once the executables have been run
|   |   └── ...
|   └── _tests ------------------- contains tests
|       └── ...
└── _tests ----------------------- contains various tests
    └── ...