Contracting nodes into clusters for huge graphs

hcars commented 3 years ago

Truly, huge graphs (1,000,000s of nodes) may not produce useful visualizations with all nodes being plotted. In these cases, we may want an option to contract nodes into their clusters based on available clustering algorithms.

hcars commented 3 years ago

The following clusterings are available:

Clustering

igraph includes several approaches to unsupervised graph clustering and community detection:

Graph.components() (aka Graph.clusters()): the connected components
Graph.cohesive_blocks()
Graph.community_edge_betweenness()
Graph.community_fastgreedy()
Graph.community_infomap()
Graph.community_label_propagation()
Graph.community_leading_eigenvector()
Graph.community_leading_eigenvector_naive()
Graph.community_leiden()
Graph.community_multilevel() (a version of Louvain)
Graph.community_optimal_modularity() (exact solution, < 100 vertices)
Graph.community_spinglass()
Graph.community_walktrap()

@cckuhlman @ssravi-1984 I am thinking that we should compare some subset of these clusterings on a graph if plotting is invoked with clustering enabled. What do you think of how to choose said subset? Also, I understand that we should be using modularity score in our decision on which clustering to accept, but I am not sure how. Do you have any experience with this?

cckuhlman commented 3 years ago

Clustering is a good idea.

I have been away from clustering for a while, but about 5 years back or so, Louvain method was the defacto standard code to run community detection. It is based on modularity, developed, or at least popularized by Mark Newman.
But the reason it is used so often is that it is fast. Very fast. Very fast on large graphs. That is, it scales very well.

Of the ones you provide, the above suggests trying : Graph.community_multilevel() (a version of Louvain).

Also, there are C++ implementations of Louvain on the internet, for free download. It is easy to build and run. You might compare the igraph method with the pure Louvain. Both execution time and quality of visualizations that result from the communities.

You might try them on the undirected networks from the metadata extraction paper (Table 3), here: https://github.com/NSSAC/cines-meta-data-paper

@ssravi-1984

hcars commented 3 years ago

@cckuhlman I have been using the community_multilevel, and it seems to work quite well. Apparently, Leiden is a modification of Louvain that avoids poorly connected communities. I tried them both out on a large Amazon recommendation network, and I suspect that community_multilevel allowing poorly connected communities (in the worst-case) allows it to reduce the graph into a more readable manner than Leiden.

hcars commented 3 years ago

We are using community_multilevel, currently. I would like an option to dynamically decide between different algorithms based on their modularity score.

Users could choose a set of clustering algorithms to attempt, and we could choose the clustering with the best modularity score. Thoughts?

Any ideas on how this should proceed? @cckuhlman @ssravi-1984 @Lucaslhm

cckuhlman commented 3 years ago

This is a good idea Henry. Thanks. But in any code that has this type of approach (i.e., code makes decision), make sure there is an "override" where we can specify precisely what method we want in any given invocation (that is, take the decision making away from the code and into the hands of the analyst).

Am thinking of this use case; there are probably others: method X gives us the greatest modularity score, but method Y gives us a "better" looking resulting plot of a graph (where communities are nodes; and edges are edges between communities). "Better" may be fewer nodes, so it is easier to understand the graph, or more easily interpretable edges. That is, visualizations and understanding them with the eyeball is important.

hcars commented 3 years ago

But in any code that has this type of approach (i.e., code makes decision), make sure there is an "override" where we can specify precisely what method we want in any given invocation (that is, take the decision making away from the code and into the hands of the analyst).

I was thinking the "override" could be to only request one clustering be computed. Does this sound ok?

cckuhlman commented 3 years ago

Yes. Thanks.

hcars commented 3 years ago

This feature is implemented.

NSSAC / graph-plot

Contracting nodes into clusters for huge graphs #4