CWTSLeiden / networkanalysis

Java package that provides data structures and algorithms for network analysis.
MIT License
145 stars 33 forks source link

Question about clustering results #25

Open gcasamat opened 11 months ago

gcasamat commented 11 months ago

Hi,

Thanks for this fantastic package! I have an issue related to the output of the clustering algorithm. I have runned the following code in python:

os.system("java -cp /Applications/networkanalysis/networkanalysis-1.3.0.jar nl.cwts.networkanalysis.run.RunNetworkClustering"
                      " -n AssociationStrength -r 1 -m 20 --sorted-edge-list"
                      " -o " "clusters.txt"
                      " data_net.txt")

The output message announces 829 clusters and 760 clusters after removing clusters consisting of fewer than 20 nodes. However when I open the file clusters.txt:

Many thanks in advance for your explanation.

vtraag commented 11 months ago

Could you please provide a minimal reproducible example? Then we might be able to debug any problem. Without being able to replicate the problem, we also cannot solve it.

gcasamat commented 10 months ago

You can find below some input and output txt files for replicating the issue.

The command I execute is:

java -cp /Applications/networkanalysis/networkanalysis-1.3.0.jar nl.cwts.networkanalysis.run.RunNetworkClustering
                      -n AssociationStrength -r 50 -m 50 --sorted-edge-list
                      -o net_clusters_res50.txt data_net.txt

The output message from networkanalysis is:

Quality function: CPM Normalization method: AssociationStrength Resolution parameter: 50.0 Minimum cluster size: 50 Number of random starts: 1 Number of iterations: 10 Randomness parameter: 0.01 Random number generator seed: random Running algorithm took 0s. Quality function equals 0.9256850092525903. Clustering consists of 1354 clusters. Removing clusters consisting of fewer than 50 nodes. Final clustering consists of 1353 clusters.

However, I count 1018 clusters in the file net_clusters_res50.txt, with many clusters less than 50 items.

Thanks for your help.

data_net.txt net_clusters_res50.txt

vtraag commented 10 months ago

There are two separate issues here:

  1. Communities are not consecutively numbered.
  2. Clusters may have less nodes than indicated by the threshold.

The first item should be solved, I've opened a PR in #27 for this.

The second item cannot be solved in this case. That is, your network contains several components (1006, to be precise). The algorithm will never create clusters larger than the individual components. This will not be changed.

It might be a possibility to check connected components and provide a warning if the connected components are smaller than the minimum desired community size. However, this also means that more time is spent in checking this, so there should at least be an option to turn it off. What do you think @neesjanvaneck ?