:heavy_exclamation_mark: ATTENTION: the new version of PyCABeM based on PyExPool v.3+ with WebUI is rebranded to Clubmark (demonstrated and published in ICDM18) :heavy_exclamation_mark:

PyCABeM (former HiCBeM) - Python Benchmarking Framework for the Clustering Algorithms Evaluation

\brief Uses extrinsic (NMIs) and intrinsic (Q) measures for the clusters quality evaluation considering overlaps (nodes membership by multiple clusters)
\author: (c) Artem Lutov artem@exascale.info
\organizations: eXascale Infolab, Lumais, ScienceWise
\keywords: overlapping clustering benchmarking, community detection benchmarking, algorithms benchmarking framework.

Content

Functionality
Dependencies
Prerequisites
Usage
Usage Examples
Benchmark Structure
Extension
Related Projects

Functionality

Generic Benchmarking Framework

optionally generates or preprocesses datasets using specified executable(s) (by default uses LFR framework for overlapping weightted networks)
optionally executes specified apps (clustering algorithms; can be a binary, any script or java executable) with the specified params on the specified datasets (networks)
optionally evaluates results of the execution using specified executable(s) (by default performs NMIs and Q evaluation) and performs unified aggregation of results from multiple apps on multiple datasets into the single file by the specified measure
per-task and global timeouts (for an app execution on a single dataset) and specified number of CPU cores (workers) are set for the batch apps execution / evaluation using the multi-process task execution pool (mpepool)
per-task and accumulative execution tracing and resutls logging is performed even in case of internal / external interruptions and crashes:
- all stdout/err output is logged
- resources consumption, i. e. time: execution (wall-clock) and CPU concumption (user, kernel, total), memory (RAM RSS) are traced
automatic extension / backup of the previously existent results to .gzip with the timestamp on the benchmarking reexecution

It is possible to have multiple input directories with similary named files inside, which represent different instances / snapshots of the datasets. In such case, the output results are provided per each snapshot, plus aggregated weighted average over all snapshots. This is useful to avoid occasional bias to the specific instance or to analize evolving networks.

In case of the measured application crash, the crash is logged and has no any impact on the exectuion of the remaining applications.

Benchmark of the Hierarchical Overlapping Clustering Algorithms

The benchmark is implemented as customization of the Generic Benchmarking Framework to evaluate Hierarchical Overlapping Clustering Algorithms:

produces synthetic networks with specified number of instances for each set of parameters, generating them by the extended LFR Framework ("Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities" by Andrea Lancichinetti and Santo Fortunato)
shuffles (reorders nodes) specified networks specified number of times, which is required to evaluate stability / determinism of the clustering algorithms
executes
- HiReCS (www.lumais.com/hirecs)
- SCP (Sequential algorithm for fast clique percolation)
- Louvain (original and igraph implementations)
- Oslom2
- GANXiS/SLPA (but this algorithm is not uploaded into the repository, because it was provided by the author Jerry Xie for "academic use only"; deterministic algorithm LabelRankT is a modification of GANXiS, but LabelRankT is not publicly available)
- Randcommuns - generation of random communities (clusters) with struture of clusters similar to the ground-truth: the same number of random connected nodes in the number of clusters taken from the ground-truth
clustering algorithms on the generated synthetic networks (or on any specified directories and files). Outputs results (clusters/communities structure, hierarchy, modularity, nmi, etc.) of the clustering algorithms are stored in the corresponding files.

Features \ Algs	HiReCS	SCP	Louvain	Oslom2	GANXiS
Hierarchical	+		+	+
Multi-scale	+	+	+	+	+
Deterministic	+	+
With Overlaps	+	+		+	+
Parameter-Free	+		+

evaluates results using:
- extrinsic measures - NMIs for overlapping communities, extended to have uniform input / output formats:
  - NMI - gecmi (https://bitbucket.org/dsign/gecmi/wiki/Home, "Comparing network covers using mutual information" by Alcides Viamontes Esquivel, Martin Rosvall)
  - NMI_s - onmi (https://github.com/aaronmcdaid/Overlapping-NMI, "Normalized Mutual Information to evaluate overlapping community finding algorithms" by Aaron F. McDaid, Derek Greene, Neil Hurley)
- intrinsic measure - Q (standard modularity value, but applicable for overlapping communities), evaluated by HiReCS (http://www.lumais.com/hirecs)
resources consumption is evaluated using exectime profiler (https://bitbucket.org/lumais/exectime/)

All results and traces are stored into the corresponding files even in case of internal (crash) / external termination of the benchmarking applications or the whole framework.

Note: valuable extensions of the employed external applications are uploaded into ./contrib/

Basically the framework executes a set of applications on the specified datasets in interactive or daemon mode, logging the resources consumption, output and exceptions, providing workflow management (termination by timeout, resistance to exceptions, etc.) and results aggregation.

Dependencies

Prerequisites

Be sure that the operational system allows to work with lots of opened files:

Max number of the opened files in the system$ sysctl fs.file-max should be at least 1048576
Max number of the opened files per a process$ ulimit -n should be at least 4096, may be higher depending on the evaluating datasets and algorithms.

To setup fs.file-max permanently in the system add the following line to the /etc/sysctl.conf:

fs.file-max = 1048576

and then reload it by # sysctl -p.
To setup the ulimit permanently add the following lines to the /etc/security/limits.conf:

*               hard    nofile          524288
*               soft    nofile          4096

And then execute ulimit -n 65536 to set this value for the current process.

Fundamental

Python 2.7+ (or PyPy JIT for the fast execution).

Note: It is recommended to run the benchmark itself under PyPy. The measured algorithms can be ran either using the same python or under the dedicated interpreter / script / executable.

Libraries

hirecs for modularity evaluation of overlapping community structure with results compatible to the standard modularity value. It depends on:
- libstdc++.so.6: version GLIBCXX_3.4.20 (precompiled version for modularity evaluation). To install it on Ubuntu use: sudo apt-get install libstdc++6 or
```
$ sudo add-apt-repository ppa:ubuntu-toolchain-r/test 
$ sudo apt-get update
$ sudo apt-get install libstdc++6
```
Note: This functionality is available in the dev version of the HiReCS 2 and have not been pushed to the public hirecs repository yet. Please write me if you need it.
python-igraph for Louvain algorithm evaluation by NMIs (because the original implementation does not provide convenient output of the communities to evaluate NMIs): $ pip install python-igraph. It depends on:
- libz and libxml2, which are installed in Linux Ubuntu executing:
  $ sudo apt-get install lib32z1-dev libxml2-dev
gecmi for the NMI_ovp evaluation depends on:
- libboost_program_options, to install execute: $ sudo apt-get install libboost-program-options. The older version of gecmi compiled under Ubuntu 14 depends on libboost_program_options.so.1.54.0, the newer one compiled under Ubuntu 16 depends on libboost_program_options.so.1.58.0.
- libtbb.so.2, to install execute: sudo aptitude download libtbb2; sudo aptitude install libtbb2
Note: gecmi dependencies are uploaded to ./algorithms/gecmi_deps/.
PyExPool for asynchronious jobs execution and results aggregation via tasks of jobs

Note: it is uploaded to ./contrib/.

External tools that are used as executables

Extended LFR Benchmark for the undirected weighted networks with overlaps (origins are here: https://sites.google.com/site/santofortunato/inthepress2, https://sites.google.com/site/andrealancichinetti/files)
Tiny execution profiler to evaluate resources consumption: https://bitbucket.org/lumais/exectime/
Clustering algorithms, used in the benchmarking: HiReCS, SCP Louvain (original and igraph implementations), Oslom2 and GANXiS/SLPA

Usage

./install_depends.sh - install dependencies (using apt-get)
./benchmark.py - run the benchmark in the terminal (interactive mode)
./benchmark_daemon.sh - run the benchmark in background (daemon mode)

Note: Execution of the benchmark was verified only on Linux Ubuntu 14.04 x64, but it should work on any platform if corresponding external executables (algorithms, nmi evaluation apps, etc.) are provided for the required platform.

To see possible input parameters run the benchmark without arguments: $ ./benchmark.py:

$ ./benchmark.py 
Usage: ./benchmark.py [-g[f][=[<number>][.<shuffles_number>][=<outpdir>]] [-c[f][r]] [-a="app1 app2 ..."] [-r] [-e[n][s][e][m]] [-d[g]{a,s}=<datasets_dir>] [-f[g]{a,s}=<dataset>] [-t[{s,m,h}]=<timeout>]
Parameters:
  -g[f][=[<number>][.<shuffles_number>][=<outpdir>]]  - generate <number> (5 by default) >= 0 synthetic datasets in the <outpdir> ("syntnets/" by default), shuffling each <shuffles_number> (0 by default) >= 0 times. If <number> is omitted or set to 0 then ONLY shuffling of the specified datasets should be performed including the <outpdir>/networks//*.
    Xf  - force the generation even when the data already exists (existent datasets are moved to backup)
  NOTE:
    - shuffled datasets have the following naming format: <base_name>[^<instance_index>][(seppars)<param1>...][.<shuffle_index>].<net_extension>
    - use "-g0" to execute existing synthetic networks not changing them
  -c[X]  - convert existing networks into the .hig, .lig, etc. formats
    Xf  - force the conversion even when the data is already exist
    Xr  - resolve (remove) duplicated links on conversion. Note: this option is recommended to be used
  NOTE: files with .nsa are looked for in the specified dirs to be converted
  -a="app1 app2 ..."  - apps (clustering algorithms) to run/benchmark among the implemented. Available: scp louvain_igraph randcommuns hirecs oslom2 ganxis. Impacts {r, e} options. Optional, all apps are executed by default.
  NOTE: output results are stored in the "algorithms/<algname>outp/" directory
  -r  - run the benchmarking apps on the prepared data
  -e[X]  - evaluate quality of the results. Default: apply all measurements
    Xn  - evaluate results accuracy using NMI measure for overlapping communities
    Xs  - evaluate results accuracy using NMI_s measure for overlapping communities
    Xe  - evaluate results accuracy using extrinsic measures (both NMIs) for overlapping communities (same as Xns)
    Xm  - evaluate results quality by modularity
  -d[X]=<datasets_dir>  - directory of the datasets.
  -f[X]=<dataset>  - dataset (network, graph) file name.
    Xg  - generate directory with the network file name without extension for each input network (*.nsa) when shuffling is performed (to avoids flooding of the base directory with network shuffles). Previously existed shuffles are backuped
    Xa  - the dataset is specified by asymmetric links (in/outbound weights of the link might differ), arcs
    Xs  - the dataset is specified by symmetric links, edges. Default option
    NOTE:
     - datasets file names must not contain "." (besides the extension), because it is used as indicator of the shuffled datasets
    - paths can contain wildcards: *, ?, +    - multiple directories and files can be specified via multiple -d/f options (one per the item)
    - datasets should have the following format: <node_src> <node_dest> [<weight>]
    - {a,s} is considered only if the network file has no corresponding metadata (formats like SNAP, ncol, nsa, ...)
    - ambiguity of links weight resolution in case of duplicates (or edges specified in both directions) is up to the clustering algorithm
  -t[X]=<float_number>  - specifies timeout for each benchmarking application per single evaluation on each network in sec, min or hours. Default: 0 sec  - no timeout
    Xs  - time in seconds. Default option
    Xm  - time in minutes
    Xh  - time in hours

Usage Examples

Synthetic networks generation, clustering algorithms execution and evaluation

$ pypy ./benchmark.py -g=3.2=syntnets_i3_s4 -cr -a="scp oslom2" -r -emn -tm=90

Run the benchmark under PyPy.
Generate synthetic networks producing 3 instances of each network with 2 shuffles (random reordering of network nodes) of each instance, having 3*2=6 sythetic networks of each type (for each set of network generation parameters). Generated networks are stored in the ./syntnets_i3_s4/ directory.
Convert all networks into the .hig format resolving dulicated links. This conversion is required to be able to evaluate modularity measure.
Run scp and oslom2 clustering algorithms for each generated network and evaluate modularity and NMI measures for these algorithms.
TImeout is 90 min for each task of each network processing, where the tasks are: networks generation, clustering and evaluation by each specified measure. The network is each shuffle of each instance of each network type.

Shuffling existing network instances, clustering algorithm execution and evaluation

$ ./benchmark.py -g=.4 -d=syntnets_i3_s4 -a=oslom2 -es -th=1

Run the benchmark for the networks located in ./syntnets_i3_s4/ directory.
Produce 4 shuffles of the specified networks, previously existed shuffles are backed up.
Run oslom2 clusterng algorithm for the specified networks with their shuffles and evaluate NMI_s measure.
Timeout is 1 hour for each task on each network.

Aggregation of the specified evaluation results

$ pypy benchmark.py -s=results/scp/mod/*.mod

Results aggregation is performed with automatic identification of the target clustering algorithm and evaluation measure by the specified path. It is performed automatically as the last step of the algorithm evaluation, but also can be called manually for the modified scope.

Benchmark Structure

./contrib/ - valuable patches to the external open source tools used as binaries
./algorithms/ - benchmarking algorithms
./resutls/ - aggregated and per-algorithm execution and evaluation results (brief *.res and extended *.resx): timings (execution and CPU), memory consumption, NMIs, Q, per-algorithm resources consumption profile (*.rcp)
- <algname>.rcp - resource consumption profile for all executions of the algorithm even in case of crashes / interruptions
- <measure>.res[x] - aggregated value of the measure: average is evaluated for each level / scale for all shuffles of the each network instance, then the weighted best average among all levels is taken for all instances as a final result
- /clusters/ - algorithm execution results produced hierachies of communities for each network instance shuffle - `*.cnl` - resulting clusters unwrapped to nodes (community nodes list) for NMIs evaluation. `*.cnl` are generated either per each level of the resulting hierarchy of communities or for the whole hierarchy (parameterized inside the benchmark)
- /mod/ - algorithm evaluation modularity for each produced hierarchical/scale level - `.mod` - modularity value aggregated per network instances (results for all shuffles on the network instance are aggregated in the same file)
- /nmi[_s]/ - algorithm evaluation NMI[_s] for each produced hierarchical/scale level - `.nmi[_s]` - NMI[_s] value aggregated per network instances
- *.log - stdout of the executed algorithm, logs
- *.err - stderr of the executed algorithm and benchmarking routings, errors

Example of the <entity>.rcp format:

# ExecTime(sec) CPU_time(sec)   CPU_usr(sec)    CPU_kern(sec)   RSS_RAM_peak(Mb)    TaskName
2.575555    2.574302    2.540420    0.033882    6.082   5K5
0.528582    0.528704    0.519277    0.009427    3.711   2K10
...

Example of the .res format:

# --- 2015-12-31 16:15:37.693514, output:  Q_avg
# <network> ganxis  louvain_igraph  ...
karate  0.130950    0.414481    0.233974    0.240929
jazz_u  0.330844    0.400587    0.392081    0.292395
...

Example of the .resx format:

# --- 2015-12-31 17:05:50.582245 ---
# <network>
#   <alg1_outp>
#   <alg2_outp>
#   ...
karate
    ganxis> Q: 0.130950 (0.084073 .. 0.217867), s: 0.163688, count: 5, fails: 0, d(shuf): 0.133794, s(shuf): 0.0566965, count(shuf): 5, fails(shuf): 0
    louvain_igraph> Q: 0.414481 (0.395217 .. 0.419790), s: 0.518101, count: 5, fails: 0, d(shuf): 0.024573, s(shuf): 0.0120524, count(shuf): 5, fails(shuf): 0
    ...
jazz_u
    ganxis> Q: 0.340728 (0.321374 .. 0.371617), s: 0.42591, count: 5, fails: 0, d(shuf): 0.050243, s(shuf): 0.0219596, count(shuf): 5, fails(shuf): 0
    louvain_igraph> Q: 0.400587 (0.399932 .. 0.400999), s: 0.534116, count: 4, fails: 0, d(shuf): 0.001067, s(shuf): 0.000595067, count(shuf): 4, fails(shuf): 0
    ...
...

Example of the <net_instance>.nmi[_s] format:

# NMI   level[/shuffle]
0.815814    0
0.870791    1
0.882737    0/1
...

Example of the <net_instance>.mod format:

# Q level[/shuffle]
0.333874    1
0.32539 0
0.313085    0/1
...

./realnets/ - simple gold standard networks with available ground truth value of the modularity for non-overlapping clustering (from DIMACS 10th, "Modularity Maximization in Networks by Variable Neighborhood Search" by Daniel Aloise et al)
./syntnets/ - synthetic networks produced by the extended LFR framework: undirected weighted complex networks with overlaps, both mixing parameters are set for the topology and weights, both exponential nodes degree and weights distributions are set
- *.ngp - network generation parameters
- time_seed.dat - used time seed on batch generation
- *.ngs - time seed for the network (network generation seed)
- *.nst - statistics for the generated network (network statistics)
- *.nsa - generated network to be processed as input graph by the algorithms to build the community structure. The network is specified by newline / space/tab separated arcs as a list of lines: <src_id> <dst_id> [<weight>]
- *.cnl - ground truth for the community structure (cluster/community nodes list) generated by the LFR framework. It is specified by the space/tab separated nodes for each cluster (a line in the file): <c1_nid_1> <c1_nid_2> ...
./exectime - lightweight resource consumption profiler
./benchmark.py - the benchmark (interactive mode)
./benchmark_daemon.sh - the shell script to execute the benchmark in background (daemon mode)
./install_depends.sh - the shell script to install dependencies

Extension

To add custom apps / algorithms to be benchmarked just add corresponding function for "myalgorithm" app to benchapps.py:

def execMyalgorithm(execpool, netfile, asym, timeout, pathid='', selfexec=False)
    """Execute the algorithm (stub)

    execpool  - execution pool to perform execution of current task
    netfile  -  input network to be processed
    asym  - network links weights are assymetric (in/outbound weights can be different)
    timeout  - execution timeout for this task
    pathid  - path id of the net to distinguish nets with the same name located in different dirs.
        Note: pathid is prepended with the separator symbol
    selfexec  - current execution is the external or internal self call

    return  - number of executions (jobs) made
    """

All the evaluatoins will be performed automatically, the algorithm should just follow convension of the execution results output.

Related Projects

HiReCS - High Resolution Hierarchical Clustering with Stable State: https://github.com/XI-lab/hirecs

If you are interested in this benchmark, please visit eXascale Infolab where you can find another projects and research papers related to Big Data!

eXascaleInfolab / PyCABeM

readme