eXascaleInfolab / PyCABeM

Python Benchmarking Framework for the Clustering Algorithms Evaluation: networks generation and shuffling; failover execution and resource consumption tracing (peak RAM RSS, CPU, ...); evaluation of Modularity, conductance, NMI and F1 Score for overlapping communities
Other
19 stars 4 forks source link

:heavy_exclamation_mark: ATTENTION: the new version of PyCABeM based on PyExPool v.3+ with WebUI is rebranded to Clubmark (demonstrated and published in ICDM18) :heavy_exclamation_mark:


PyCABeM (former HiCBeM) - Python Benchmarking Framework for the Clustering Algorithms Evaluation

\brief Uses extrinsic (NMIs) and intrinsic (Q) measures for the clusters quality evaluation considering overlaps (nodes membership by multiple clusters)
\author: (c) Artem Lutov artem@exascale.info
\organizations: eXascale Infolab, Lumais, ScienceWise
\keywords: overlapping clustering benchmarking, community detection benchmarking, algorithms benchmarking framework.

Content

Functionality
Dependencies
Prerequisites
Usage
Usage Examples
Benchmark Structure
Extension
Related Projects

Functionality

Generic Benchmarking Framework

It is possible to have multiple input directories with similary named files inside, which represent different instances / snapshots of the datasets. In such case, the output results are provided per each snapshot, plus aggregated weighted average over all snapshots. This is useful to avoid occasional bias to the specific instance or to analize evolving networks.

In case of the measured application crash, the crash is logged and has no any impact on the exectuion of the remaining applications.

Benchmark of the Hierarchical Overlapping Clustering Algorithms

The benchmark is implemented as customization of the Generic Benchmarking Framework to evaluate Hierarchical Overlapping Clustering Algorithms:

Features \ Algs HiReCS SCP Louvain Oslom2 GANXiS
Hierarchical + + +
Multi-scale + + + + +
Deterministic + +
With Overlaps + + + +
Parameter-Free + +

All results and traces are stored into the corresponding files even in case of internal (crash) / external termination of the benchmarking applications or the whole framework.

Note: valuable extensions of the employed external applications are uploaded into ./contrib/

Basically the framework executes a set of applications on the specified datasets in interactive or daemon mode, logging the resources consumption, output and exceptions, providing workflow management (termination by timeout, resistance to exceptions, etc.) and results aggregation.

Dependencies

Prerequisites

Be sure that the operational system allows to work with lots of opened files:

To setup fs.file-max permanently in the system add the following line to the /etc/sysctl.conf:

fs.file-max = 1048576

and then reload it by # sysctl -p.
To setup the ulimit permanently add the following lines to the /etc/security/limits.conf:

*               hard    nofile          524288
*               soft    nofile          4096  

And then execute ulimit -n 65536 to set this value for the current process.

Fundamental

Note: It is recommended to run the benchmark itself under PyPy. The measured algorithms can be ran either using the same python or under the dedicated interpreter / script / executable.

Libraries

External tools that are used as executables

Usage

Note: Execution of the benchmark was verified only on Linux Ubuntu 14.04 x64, but it should work on any platform if corresponding external executables (algorithms, nmi evaluation apps, etc.) are provided for the required platform.

To see possible input parameters run the benchmark without arguments: $ ./benchmark.py:

$ ./benchmark.py 
Usage: ./benchmark.py [-g[f][=[<number>][.<shuffles_number>][=<outpdir>]] [-c[f][r]] [-a="app1 app2 ..."] [-r] [-e[n][s][e][m]] [-d[g]{a,s}=<datasets_dir>] [-f[g]{a,s}=<dataset>] [-t[{s,m,h}]=<timeout>]
Parameters:
  -g[f][=[<number>][.<shuffles_number>][=<outpdir>]]  - generate <number> (5 by default) >= 0 synthetic datasets in the <outpdir> ("syntnets/" by default), shuffling each <shuffles_number> (0 by default) >= 0 times. If <number> is omitted or set to 0 then ONLY shuffling of the specified datasets should be performed including the <outpdir>/networks//*.
    Xf  - force the generation even when the data already exists (existent datasets are moved to backup)
  NOTE:
    - shuffled datasets have the following naming format: <base_name>[^<instance_index>][(seppars)<param1>...][.<shuffle_index>].<net_extension>
    - use "-g0" to execute existing synthetic networks not changing them
  -c[X]  - convert existing networks into the .hig, .lig, etc. formats
    Xf  - force the conversion even when the data is already exist
    Xr  - resolve (remove) duplicated links on conversion. Note: this option is recommended to be used
  NOTE: files with .nsa are looked for in the specified dirs to be converted
  -a="app1 app2 ..."  - apps (clustering algorithms) to run/benchmark among the implemented. Available: scp louvain_igraph randcommuns hirecs oslom2 ganxis. Impacts {r, e} options. Optional, all apps are executed by default.
  NOTE: output results are stored in the "algorithms/<algname>outp/" directory
  -r  - run the benchmarking apps on the prepared data
  -e[X]  - evaluate quality of the results. Default: apply all measurements
    Xn  - evaluate results accuracy using NMI measure for overlapping communities
    Xs  - evaluate results accuracy using NMI_s measure for overlapping communities
    Xe  - evaluate results accuracy using extrinsic measures (both NMIs) for overlapping communities (same as Xns)
    Xm  - evaluate results quality by modularity
  -d[X]=<datasets_dir>  - directory of the datasets.
  -f[X]=<dataset>  - dataset (network, graph) file name.
    Xg  - generate directory with the network file name without extension for each input network (*.nsa) when shuffling is performed (to avoids flooding of the base directory with network shuffles). Previously existed shuffles are backuped
    Xa  - the dataset is specified by asymmetric links (in/outbound weights of the link might differ), arcs
    Xs  - the dataset is specified by symmetric links, edges. Default option
    NOTE:
     - datasets file names must not contain "." (besides the extension), because it is used as indicator of the shuffled datasets
    - paths can contain wildcards: *, ?, +    - multiple directories and files can be specified via multiple -d/f options (one per the item)
    - datasets should have the following format: <node_src> <node_dest> [<weight>]
    - {a,s} is considered only if the network file has no corresponding metadata (formats like SNAP, ncol, nsa, ...)
    - ambiguity of links weight resolution in case of duplicates (or edges specified in both directions) is up to the clustering algorithm
  -t[X]=<float_number>  - specifies timeout for each benchmarking application per single evaluation on each network in sec, min or hours. Default: 0 sec  - no timeout
    Xs  - time in seconds. Default option
    Xm  - time in minutes
    Xh  - time in hours

Usage Examples

Synthetic networks generation, clustering algorithms execution and evaluation

$ pypy ./benchmark.py -g=3.2=syntnets_i3_s4 -cr -a="scp oslom2" -r -emn -tm=90

Run the benchmark under PyPy.
Generate synthetic networks producing 3 instances of each network with 2 shuffles (random reordering of network nodes) of each instance, having 3*2=6 sythetic networks of each type (for each set of network generation parameters). Generated networks are stored in the ./syntnets_i3_s4/ directory.
Convert all networks into the .hig format resolving dulicated links. This conversion is required to be able to evaluate modularity measure.
Run scp and oslom2 clustering algorithms for each generated network and evaluate modularity and NMI measures for these algorithms.
TImeout is 90 min for each task of each network processing, where the tasks are: networks generation, clustering and evaluation by each specified measure. The network is each shuffle of each instance of each network type.

Shuffling existing network instances, clustering algorithm execution and evaluation

$ ./benchmark.py -g=.4 -d=syntnets_i3_s4 -a=oslom2 -es -th=1

Run the benchmark for the networks located in ./syntnets_i3_s4/ directory.
Produce 4 shuffles of the specified networks, previously existed shuffles are backed up.
Run oslom2 clusterng algorithm for the specified networks with their shuffles and evaluate NMI_s measure.
Timeout is 1 hour for each task on each network.

Aggregation of the specified evaluation results

$ pypy benchmark.py -s=results/scp/mod/*.mod

Results aggregation is performed with automatic identification of the target clustering algorithm and evaluation measure by the specified path. It is performed automatically as the last step of the algorithm evaluation, but also can be called manually for the modified scope.

Benchmark Structure

Example of the <entity>.rcp format:

# ExecTime(sec) CPU_time(sec)   CPU_usr(sec)    CPU_kern(sec)   RSS_RAM_peak(Mb)    TaskName
2.575555    2.574302    2.540420    0.033882    6.082   5K5
0.528582    0.528704    0.519277    0.009427    3.711   2K10
...

Example of the .res format:

# --- 2015-12-31 16:15:37.693514, output:  Q_avg
# <network> ganxis  louvain_igraph  ...
karate  0.130950    0.414481    0.233974    0.240929
jazz_u  0.330844    0.400587    0.392081    0.292395
...

Example of the .resx format:

# --- 2015-12-31 17:05:50.582245 ---
# <network>
#   <alg1_outp>
#   <alg2_outp>
#   ...
karate
    ganxis> Q: 0.130950 (0.084073 .. 0.217867), s: 0.163688, count: 5, fails: 0, d(shuf): 0.133794, s(shuf): 0.0566965, count(shuf): 5, fails(shuf): 0
    louvain_igraph> Q: 0.414481 (0.395217 .. 0.419790), s: 0.518101, count: 5, fails: 0, d(shuf): 0.024573, s(shuf): 0.0120524, count(shuf): 5, fails(shuf): 0
    ...
jazz_u
    ganxis> Q: 0.340728 (0.321374 .. 0.371617), s: 0.42591, count: 5, fails: 0, d(shuf): 0.050243, s(shuf): 0.0219596, count(shuf): 5, fails(shuf): 0
    louvain_igraph> Q: 0.400587 (0.399932 .. 0.400999), s: 0.534116, count: 4, fails: 0, d(shuf): 0.001067, s(shuf): 0.000595067, count(shuf): 4, fails(shuf): 0
    ...
...

Example of the <net_instance>.nmi[_s] format:

# NMI   level[/shuffle]
0.815814    0
0.870791    1
0.882737    0/1
...

Example of the <net_instance>.mod format:

# Q level[/shuffle]
0.333874    1
0.32539 0
0.313085    0/1
...

Extension

To add custom apps / algorithms to be benchmarked just add corresponding function for "myalgorithm" app to benchapps.py:

def execMyalgorithm(execpool, netfile, asym, timeout, pathid='', selfexec=False)
    """Execute the algorithm (stub)

    execpool  - execution pool to perform execution of current task
    netfile  -  input network to be processed
    asym  - network links weights are assymetric (in/outbound weights can be different)
    timeout  - execution timeout for this task
    pathid  - path id of the net to distinguish nets with the same name located in different dirs.
        Note: pathid is prepended with the separator symbol
    selfexec  - current execution is the external or internal self call

    return  - number of executions (jobs) made
    """

All the evaluatoins will be performed automatically, the algorithm should just follow convension of the execution results output.

Related Projects

If you are interested in this benchmark, please visit eXascale Infolab where you can find another projects and research papers related to Big Data!