GiulioRossetti / cdlib

Community Discovery Library
http://cdlib.readthedocs.io
BSD 2-Clause "Simplified" License
376 stars 71 forks source link

CPMVertexPartition.Bipartite implementation #139

Closed akbaritabar closed 3 years ago

akbaritabar commented 3 years ago

Is your feature request related to a problem? Please describe. First, thanks for cdlib, it is amazing. I am using it for bipartite network analysis and community detection and then evaluation of results of different algorithms. There is currently only bimlpa implemented for it. I have a suggestion.

Describe the solution you'd like Since you have already implemented CPMVertexPartition, I am guessing implementation of its bipartite (https://leidenalg.readthedocs.io/en/stable/multiplex.html?highlight=bipartite#bipartite) version should not be much complicated (I certainly hope so, but my background is in social sciences, so I cannot say for sure).

Describe alternatives you've considered I have provided below a sample of code I have written to use the CPMVertexPartition.Bipartite, but I use it directly in Leidenalg package by https://github.com/vtraag . Do you think there is hope to have CPMVertexPartition.Bipartite in cdlib?

Additional context This is an example of the code I run directly in leidenalg library:


def find_bipartite_partitions(graph, gamma, seed=0):
    """find_bipartite_partitions(graph, seed, gamma)

    Takes an igraph graph and returns the bipartite partitions detected with the provided gamma which is the internal density of the communities passed to CPM method in leiden algorithm package by Vincent Traag (https://leidenalg.readthedocs.io/en/latest/multiplex.html#bipartite). Results are returned as a pandas dataframe which includes vertices names, cluster membership and other node attributes in graph. It also return number of clusters detecetd and members in each of them to decide and change gamma if needed.

    Parameters
    ----------
    @param graph: igraph graph that needs to be bipartite (e.g., org/paper or author/paper) and named as we use vertex names and attributes in the detaframe returned.
    @param seed: the random seed to be used in CPM method to keep results/partitions replicable.
    @param gamma: CPM method (see leidenalg for more details: https://leidenalg.readthedocs.io/en/latest/multiplex.html#bipartite) takes a gamma which is the internal density of the communities detected and uses it as a resolution parameter.

    Examples
    --------
    >>> number_partitions, partitions_freq_members, results_df, p_01 = net_tables.find_bipartite_partitions(graph=gg, seed=0, gamma=1.625e-07)

    """

    import igraph as ig
    import leidenalg as la
    import pandas as pd

    optimiser = la.Optimiser()
    # set seed to get the same communities
    la.Optimiser.set_rng_seed(self=optimiser, value=seed)

    p_01, p_0, p_1 = la.CPMVertexPartition.Bipartite(graph, resolution_parameter_01=gamma)

    diff = optimiser.optimise_partition_multiplex([p_01, p_0, p_1], layer_weights=[1, -1, -1])

    # to see number of communities detected
    number_partitions = len(set(p_01.membership))
    # pd series of clusters with number of members in each
    partitions_freq_members = pd.Series(p_01.membership).value_counts()

    # add node attributes and cluster membership and build pandas table of results
    graph.vs['cluster'] = p_01.membership
    # for all other attributes
    all_attributes = dict()
    for node_attr in graph.vs.attributes():
        attr_list = [v[node_attr] for v in graph.vs]
        all_attributes[node_attr] = attr_list
    results_df = pd.DataFrame(all_attributes)

    return (number_partitions, partitions_freq_members, results_df, p_01)
GiulioRossetti commented 3 years ago

Hi, It is quite easy to integrate actually: I'll work on it next week.

Thank's for the issue. If you have links to other algorithms you would like to see in cdlib let us know!

akbaritabar commented 3 years ago

Thank you for your reply @GiulioRossetti, I am happy to see your open approach to this. Please have a look at this https://github.com/vtraag/leidenalg/blob/15b775897299eeea3235fe70fefa62e6c8c52f0e/src/functions.py#L100 link as well, I haven't checked in a while, it seems @vtraag has added a function for multiplex networks. Thus the implementation in cdlib might be even easier than what I said above (but I am still not sure).

vtraag commented 3 years ago

The find_partition_multiplex does something different than detecting communities on a bipartite graph, so I wouldn't recommend going there. It is indeed the easiest to use the CPMVertexPartition.Bipartite "constructor", and then use the optimise_partition_multiplex function as @akbaritabar suggested above. The underlying way this works is explained a bit more in-depth in the documentation. If you have any questions about this, feel free to ping me.

GiulioRossetti commented 3 years ago

Thanks to you both for the clarification. As promised, CPM Bipartite is also accessible through CDlib now (only on the GitHub master branch for now). Here's the doc.

By the way, @vtraag thanks for your excellent leidenalg library!

akbaritabar commented 3 years ago

Dear Giulio, Thank you for the reply and implementation.

I installed the Github master now and used it to replicate the CD I had done on multiple bipartite networks with Leidenalg library (code I posted above) and performed CPM_bipartite using CDlib using the same resolution parameters and seeds and following the example you kindly shared in documentation.

Results are consistent and similar, thanks. I am going to close this issue.

If I may give a brief suggestion please for the example you have in the documentation:


>>> from cdlib import algorithms
>>> import networkx as nx
>>> G = nx.algorithms.bipartite.generators.random_graph(100, 20, 0.1)
>>> coms = algorithms.CPM_Bipartite(G, 1)

It would be better if you increase the probability of ties from 0.1 to something like 0.5 in the example given in documentation, because the former sometimes causes the bipartite network to have disconnected components one needs to first extract the giant component and then perform CD on it otherwise it gives an error. Increasing that probability for demonstration purposes could resolve this.