This page is a companion for the paper
> Amir Ghasemian, Homa Hosseinmardi, and Aaron Clauset
>
Evaluating Overfit and Underfit in Models of Network Community Structure, IEEE TKDE 32(9), 1722-1735 (2019).
on evaluating the overfitting and underfitting of community detection methods on real networks.
CommunityFitNet provides both (i) a reference set of networks and (ii) partitions of those networks for a large set of state-of-the-art community detection algorithms (Table 1 of the paper).
The purpose of this package is to facilitate between-algorithm comparisons on a large and realistic corpus of network data sets drawn from a variety of domains and of a variety of sizes. The qualitative behavior of new community detection algorithms can be assessed by comparing their partitions to those in the reference set. To compare a new algorithm with those in our evaluation set of algorithms, a researcher can run the new algorithm on the proposed benchmark, and identify which reference algorithm has the most similar behavior, e.g., in the average number of communities found (Fig. 2b of the paper). We believe the availability of this benchmark and the results of running so many state-of-the-art algorithms on it should facilitate further advances in developing community detection algorithms.
Fig. 2b of the paper
General algorithms like MDL, Bayesian methods and regularized-likelihood algorithms tend to perform very well under different settings and can be used as reference methods for comparing with new methods. Additionally, popular methods like Infomap and modularity tend to over-fit in practice and are thus not generally reliable, at least under link prediction (Fig. 4 of the paper). However, when these more specialized methods are paired with their preferred inputs, they tend to perform much better (Fig. 6 of the paper). Generally community detection algorithms can be categorized into two general settings of probabilistic and heuristic methods. This dichotomy can be seen in the hierarchical clustering of 16 state-of-the-art community detection algorithms (Fig. 3 of the paper).
Fig. 4 of the paper
Fig. 6 of the paper
Fig. 3 of the paper
To appear, IEEE Trans. Knowledge and Data Engineering (TKDE) (2019),
Evaluating and Comparing Overfit in Models of Network Community Structure
Amir Ghasemian, Homa Hosseinmardi, and Aaron Clauset
( arXiv version )
We found an indexing issue in some of the bipartite networks (all “Norwegian_Board_of_Directors_net2mode…” (111 networks — network ids = 254–364) and one network called “Aishihik_Lake_host-parasite_web_Aishihik_Lake_host-parasite_web” (1 network - network id = 0) have this indexing issue). Therefore, we provide two versions of the data. One for replication purposes only, and one for reusing the data by other researchers.
This package contains the corpus of 572 real-world networks (the corrected networks are replaced with the networks in the paper) from many scientific domains drawn from the Index of Complex Networks (ICON). This corpus spans a variety of sizes and structures, with 22% social, 21% economic, 34% biological, 12% technological, 4% information, and 7% transportation graphs (Fig. 1 of the paper). In addition to the information about each network, we provide the partitions achieved by our set of chosen algorithms in our paper (except than for corrected networks) for further study and comparisons by other researchers in the field.
Note: This data is for replication purposes only. If you want to use this data for your project the corrected version is provided above.
This package contains the corpus of 572 real-world networks (original networks have been used in our paper) from many scientific domains drawn from the Index of Complex Networks (ICON). This corpus spans a variety of sizes and structures, with 22% social, 21% economic, 34% biological, 12% technological, 4% information, and 7% transportation graphs (Fig. 1 of the paper). In addition to the information about each network, we provide the partitions achieved by our set of chosen algorithms in our paper for further study and comparisons by other researchers in the field.
To load the data:
import pickle
# load the data
infile = open('./CommunityFitNet.pickle','rb')
df = pickle.load(infile)
# read edge lists for all networks
df_edgelists = df['edges_id'] # column 'edges_id' in dataframe df includes the edge list
# for each network
# extract the edge list for the first network
edges_orig = df_edgelists.iloc[0] # a numpy array of edge list for original graph
Fig. 1 of the paper
Table 1 of the paper
In the following table the time Complexity and the link of the code for community detection methods have been used in the paper are provided. In the time complexity column, $N$ is the number of nodes, $E$ is the number of edges, $k$ is the number of communities, $k_{\max}$ is the maximum number of clusters considered in model selection, $T$ is the number of iterations for convergence, and $\tau$ is the mixing time of the Markov chain in corresponding algorithm. The maximum number of clusters considered in model selection, $k_{\max}$, is $O(\sqrt{N})$. Therefore, the time complexity of LRT-WB can be simplified as $O(N^{3/2} k^2)$.
Note: Here, the time complexities are provided for the algorithms and the time complexity of the codes are possibly different than the ones in the table.
Table of time complexity and the link of the codes have been used for community detection algorithms
[1] P.-Y. Chen. Analysis and actions on graph data. 2016.
[8] Y. R. Wang, P. J. Bickel, et al. Likelihood-based model selection for stochastic block models. Ann. Stat., 45(2):500–528, 2017. ### How to cite this work:
If you use this data in your research, please cite it as follows:
@article{ghasemian2019evaluating, title = {Evaluating overfit and underfit in models of network community structure}, author = {Ghasemian, Amir and Hosseinmardi, Homa and Clauset, Aaron}, journal = {IEEE Transactions on Knowledge and Data Engineering}, volume = {32}, number = {9}, pages = {1722--1735}, year = {2019}, publisher = {IEEE}, href = {https://journals.aps.org/prx/abstract/10.1103/PhysRevX.6.031005}, }