Sotera / spark-distributed-louvain-modularity

Spark / graphX implementation of the distributed louvain modularity algorithm
Apache License 2.0
311 stars 153 forks source link

Might be a bug? #1

Open HongHuangNeu opened 9 years ago

HongHuangNeu commented 9 years ago

In line 233 of This file, when calculating k_i_in_L, the variable "internalWeight" is added if the community that the node is in and the community the node is testing is the same community. If I understand it correctly, variable "internalWeight" is the self-loop edge. If that is the case the weight of self loop should not be included in k_i_in because both ends of this edge are the same node, and therefore are always in the same community. @eric-kimbrel

eric-kimbrel commented 9 years ago

My understanding of the original Louvain algorithm is that internal edge weights (self loops) are included when calculating the edge weights within a community (k_i_in). This is especially important in later iterations when many nodes have been compressed to a single node in the graph with a large number of internal edges.

If the initial input graph had many self loops that the user believes to be meaningless and not significant to community structure then I think the best thing to do would be to pre-process the graph to remove those initial self loops.

So i believe the current behavior is correct, however the best way to see would be to run some tests and compare the quality of communities that result.

Ranumao commented 9 years ago

At each level in the Louvain process, the sum of intra-communities edges and inter-communities edges remain the same. At the beginning, self-loops of the initial data are the only existing intra-community edges. During the next iterations, these intra-communities edges remain and avoid communities growing endless, helping communities to stay balanced.