CWTSLeiden / networkanalysis

Java package that provides data structures and algorithms for network analysis.
MIT License
145 stars 33 forks source link

Clustering with normalization methods #7

Closed maximiliano02 closed 4 years ago

maximiliano02 commented 4 years ago

Hello Mr. Traag,

Thank you very much for give your implementation of the Leiden algorithm.

I'm using this package to cluster COVID-19 co-occurrence data.

Firstly, I used the jar archive to cluster my documents. I have observed that I obtain the same results than VOSViewer only with non-normalized edges (parameter: -w). I have also developed a unit test for that. However, with normalized data as input, I have the same number of clusters as the number of documents and poor quality. The input data and the VOS viewer output are accessible below.

In this version 1.0, is the jar only clusters data with non-normalized edges?

After that, I have analyzed the code. I have seen that the raw (non-normalized) edges are needed as input. Also methods are not used (eg. createNormalizedNetworkUsingAssociationStrength).

I plan to develop this functionality by following the steps below:

  1. Add a new parameter after -w with one of these values: {No normalization, Association strength, Fractionalization, Lin/log modularity} like VOSViewer
  2. Modify the RunNetworkClusteringclass and particularly the readEdgeListmethod
  3. Use the already prepared method (eg. createNormalizedNetworkUsingAssociationStrength) in the Networkclass

Can you please give me some advice on the implementation I plan to do?

Thank you very much.

Technical details: RunNetworkClustering.jar version 1.0.0 java version "1.8.0_161" Java(TM) SE Runtime Environment (build 1.8.0_161-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode) Windows 10 Professional

Link to data: https://drive.switch.ch/index.php/s/FxICtNgJO8B7s74

vtraag commented 4 years ago

Hi @maximiliano02, thanks for offering such an improvement! Indeed most of the methodology to create such normalizations are already present in the codebase, we simply have not yet exposed them. So, at the moment only non-normalized weights are used. Feel free to create a PR to contribute your improvement.

Regarding some of the technical details:

  1. Please note that -w is already used to indicate the network is weighted. We suggest to use -n [NORMALIZATION] to indicate the weight normalization method, where NORMALIZATION can be None (which should be the default), AssociationStrength or Fractionalization
  2. Please do not modify the readEdgeList method, but rather, call the methods createNormalizedNetworkUsingAssociationStrength or createNormalizedNetworkUsingFractionalization once the network is loaded here: https://github.com/CWTSLeiden/networkanalysis/blob/8bb4689a7fa846221e81ac1653092943469a1208/src/cwts/networkanalysis/run/RunNetworkClustering.java#L300
  3. See previous point.

If you are not yet familiar with contributing a PR, the process is, in essence, relatively simple.

  1. Create a fork
  2. Make a new branch, starting from this master branch.
  3. Push your changes to the new branch in your own fork.
  4. Create a Pull Request.

For more information about this, please refer to this explanation from GitHub.

neesjanvaneck commented 4 years ago

Thanks for your suggestion! The normalization feature is now available in the master branch. See #9.