GiulioRossetti / cdlib

Community Discovery Library
http://cdlib.readthedocs.io
BSD 2-Clause "Simplified" License
368 stars 71 forks source link

Guidance for selection of algorithms listed in overlapping communities #138

Closed DeepaMahm closed 3 years ago

DeepaMahm commented 3 years ago

Hello Everyone,

I'm trying to partition a networkx graph (with around 700 nodes) into ~5 to 7 subgraphs with overlapping nodes (overlap size >=4 nodes). I'd like to try the Overlapping Communities algorithms listed here. Unfortunately, I don't have prior experience in this field and I'm not sure which algorithm will be apt for this case. I'd like to request for recommendations on some of the algorithms that I could try for partitioning the original graph into subgraphs with overlapping nodes.

Yquetzal commented 3 years ago

Hi, With around 700 nodes most algorithms should be able to work. However, it is difficult to recommend a method in particular, since to the best of my knowledge there is no method which is considered a standard in the literature for overlapping communities. One way to go would be to run algorithms with default parameters and to see if there is one whose result fit your expectation. But I'm not the best expert, probably Giulio can give better advice.

DeepaMahm commented 3 years ago

Hi @Yquetzal Thanks a lot for the response. I tried the angel algorithm in CDLIB to partition the original graph into subgraphs with 4 overlapping nodes. i.e for any cluster/community C1 there must be a cluster C2 such that |C1∩C2|≥4.

import networkx as nx
from cdlib import algorithms

if __name__ == '__main__':

    g = nx.karate_club_graph()

    coms = algorithms.angel(g, threshold=4, min_community_size=10)
    print(coms.method_name)
    print(coms.method_parameters)  # Clustering parameters)
    print(coms.communities)
    print(coms.overlap)
    print(coms.node_coverage)

Output:

ANGEL
{'threshold': 4, 'min_community_size': 10}
[[14, 15, 18, 20, 22, 23, 27, 29, 30, 31, 32, 8], [1, 12, 13, 17, 19, 2, 21, 3, 7, 8], [14, 15, 18, 2, 20, 22, 30, 31, 33, 8]]
True
0.6470588235294118

From the communities returned, 1 and 3 have an overlap of 4 nodes. However, 2 and 3 / 1 and 3 don't have an overlap size of 4 nodes. Also, it is not clear to me how the overlap threshold (4 overlaps) has to be specified in algorithms. angel(g, threshold=4, min_community_size=10). I tried setting threshold=4 to define an overlap size of 4 nodes. However, from the documentation available for angel I understand the threshold must lie between [0, 1]

:param threshold: merging threshold in [0,1].

I am not sure how to translate the 4 overlaps to the value that has to be set between the bounds [0, 1]. I tried something like threshold=4/len(G.nodes(). This didn't return 4 overlaps though. Suggestions will be really helpful.

Thank you very much

GiulioRossetti commented 3 years ago

Hi, @Yquetzal is right, there is not a standard method for overlapping CD I'm afraid. You have to test them and identify the one that better suits your needs (your net seems of a reasonable size for all the algorithms we have in CDlib).

Consider that only a handful of the overlapping CD available allows specifying the number of desired communities (only CONGO/CONGA if I remember correctly).

I tried the angel algorithm in CDLIB to partition the original graph into subgraphs with 4 overlapping nodes. i.e for any cluster/community C1 there must be a cluster C2 such that |C1∩C2|≥4.

That's not how Angel (and similarly, Demon) thresholding parameter works.

Angel's threshold - that has to lie in [0,1] to be meaningful - specifies the minimum similarity required for local-communities to be merged (this simplifying as much as possible - please refer to the paper for a more precise definition).

This is due to how Angel works: as a first step it identifies local-communities from each ego-network, then it merges them using the similarity threshold.

Hope this helps.

Giulio

Yquetzal commented 3 years ago

An old but well-known algorithm is kclique: you choose a single parameter which is the size of cliques, usually 3, 4 or 5 depending on the density of your network. The results depend a lot on your graph, sometimes it works, sometimes it fails completely, but at least it is simple to choose the parameters... But you're right that, when we have time, we should write a tuto on how to choose algorithms, that's in the TODO list...

DeepaMahm commented 3 years ago

Hi @Yquetzal and @GiulioRossetti I tried congo on my test network with 388 nodes.

{"communities": [[13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 237, 238, 239, 240, 241, 242, 243, 244, 245, 276, 277, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 295, 297, 330, 331, 332, 333, 334, 335, 336, 337, 343, 355, 356, 357, 362, 363, 382, 383, 384, 385, 386, 387], [86, 87, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 132, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 280, 296, 298, 299, 300, 301, 302, 303, 304, 305, 306, 311, 312, 313, 314, 346, 347, 351, 352, 353, 354, 358, 366, 367, 368, 369, 370, 371, 372, 373, 374, 380, 381], [131, 133, 134, 135, 136, 137, 138, 139, 140, 141, 179, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 193, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 263, 264, 265, 266, 267, 268, 269, 270, 271, 279, 281, 308, 309, 310, 323, 341, 342, 345, 348, 349, 350, 364, 377, 378], [3, 5, 6, 7, 8, 10, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 180, 192, 194, 195, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 272, 273, 274, 282, 318, 319, 320, 321, 322, 324, 325, 326, 327, 359, 360, 375, 376, 379], [1, 2, 4, 9, 11, 12, 14, 88, 160, 161, 162, 163, 164, 165, 229, 230, 231, 232, 233, 234, 235, 236, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 275, 278, 294, 307, 315, 316, 317, 328, 329, 338, 339, 340, 344, 361, 365, 388]], "algorithm": "Congo", "params": {"number_communities": 5, "height": 20}, "overlap": true, "coverage": 1.0}

In the result returned, overlap = true. But, I don't find any overlapping vertices i.e. the clusters appear to be disjoint sets. For example, I tried

print(set(d["communities"][0]) & set(d["communities"][4]))
print(set(d["communities"][0]) & set(d["communities"][3]))
print(set(d["communities"][0]) & set(d["communities"][2]))
print(set(d["communities"][0]) & set(d["communities"][1]))

returns null.

May I know if I have missed something?

GiulioRossetti commented 3 years ago

Hi, the fact that an algorithm is designed to allow the identification of overlapping clusters does not strictly imply that that it will decompose every graph in overlapping clusters.

It is likely that, for your network/parameter settings, congo finds only disjoint communities.

Yquetzal commented 3 years ago

Note that for some applications, if you're not satisfied with the result of overlapping communities, you could first detect non-overlapping communities (thus a full partition of the network), and then do a post process to add overlap, for instance for each node that has a stronger connection (metric of your choice) with other communities than the one it was attributed too, add this node to those communities. Overlapping approaches tend to have a particular approach, more conservative, more restrictive than non-overlapping ones: they consider as communities only groups that are really cohesive intrinsically, while non-overlapping methods tend to consider a community what is cohesive relatively to the rest of the graph. That is why overlapping communities might not cover all nodes and its normal, and they might provide results without overlap and it's normal, if communities are well separated (according to the algorithm).