GiulioRossetti / cdlib

Community Discovery Library
http://cdlib.readthedocs.io
BSD 2-Clause "Simplified" License
375 stars 71 forks source link

Question: Comparing a community detection algorithm result to partial ground truth data #173

Closed BradKML closed 3 years ago

BradKML commented 3 years ago

this might sound a bit specific, but given social graph X and ground truth partition Y. If all nodes of Y are in X, BUT not all of node X is in Y, how does one do evaluation?

Yquetzal commented 3 years ago

It's a good question, but I don't think there is a standard answer. What you mean is that the ground truth is incomplete and thus the partition you found has more nodes than the ground truth ? What you can do is use an overlapping evaluation measure (https://cdlib.readthedocs.io/en/latest/reference/eval/cdlib.evaluation.overlapping_normalized_mutual_information_LFK.html), they should accept different sets of nodes, but differences will be considered "errors". Otherwise, you can compute a partial score by keeping only the set of nodes common to both partitions. That is a technique I used for some dynamic partitions in which all communities are not present all the time in the ground truth.

GiulioRossetti commented 3 years ago

I would suggest using something like NF1 which does not assumes complete partitions nor non-overlapping ones.

BradKML commented 3 years ago

@Yquetzal thanks for the input. I kind of want a solution similar to NMI, but if LFK treats the unknown nodes as errors that are no good. If there is a way where only the known nodes are evaluated, that would be very useful.

Also @GiulioRossetti thanks for presenting an alternative for F1 scores.

Yquetzal commented 3 years ago

I don't think that we will include this in the library since it is a specific case. But here is an example of code allowing you to created a "filtered" cluster based on a list of nodes that you want to keep:


unfiltered_coms = algorithms.louvain(g)

nodes_to_keep ={8,14,0,1,2}
filtered_communities = []
for com in unfiltered_coms.communities:
  filtered_com=list(set(com).intersection(node_list))
  if len(filtered_com)>0:
    filtered_communities.append(filtered_com)

filtered_clustering = NodeClustering(filtered_communities,unfiltered_coms.graph,method_name=unfiltered_coms.method_name+"_filtered",method_parameters=unfiltered_coms.method_name)```
BradKML commented 3 years ago

Thanks for this code tidbit!

BradKML commented 3 years ago

@Yquetzal nodes_to_keep is node_list ? Also even when I sanitized the data, when the two sets match it still returns "ValueError: Both partitions should cover the same node set" and when LFK is used it always output 0 for some reason.

def normalization(a, b):

    c = deepcopy(b)

    nodes_to_keep = set(a.graph.nodes)
    c_new = []
    for com in c.communities:
        filtered_com = list(set(com).intersection(nodes_to_keep))
        if len(filtered_com) > 0:
            c_new.append(filtered_com)

    return NodeClustering(c_new,c.graph,
        method_name=c.method_name+"_filtered",
        method_parameters=c.method_parameters)

def test_algorithm(main_graph, algorithm, comm, df1):
    unfiltered_coms = algorithm(main_graph)
    dic = dict(unfiltered_coms.to_node_community_map())

    filtered_clustering = normalization(comm, unfiltered_coms)
    comm = normalization(unfiltered_coms, comm)

    print(len(comm.graph.nodes), len(filtered_clustering.graph.nodes))
    print(len(set(comm.graph.nodes).symmetric_difference(filtered_clustering.graph.nodes)))

    # nmi = comm.overlapping_normalized_mutual_information_LFK(filtered_clustering)

    # nmi = comm.overlapping_normalized_mutual_information_LFK(unfiltered_coms)

    nmi = comm.normalized_mutual_information(filtered_clustering)

    updatedataFrame(df1, dic, algorithm.__name__)
    return [unfiltered_coms, dic, nmi]
Yquetzal commented 3 years ago

I'm not certain of what you're trying to do, maybe you can share an example of network on which you have the problem. In my opinion, the nodes_to_keep should be initialized with the nodes in the cluster and not the nodes of the graph, so I modified your normalization function for that:

def normalization(a, b):
    """
    a is a reference clustering
    b is the clustering to normalize
    """
    c = deepcopy(b)

    nodes_to_keep = {n for com in a.communities for n in com}
    c_new = []
    for com in c.communities:
        filtered_com = list(set(com).intersection(nodes_to_keep))
        if len(filtered_com) > 0:
            c_new.append(filtered_com)

    return NodeClustering(c_new,c.graph,
        method_name=c.method_name+"_filtered",
        method_parameters=c.method_parameters)

Then I removed the elements I did not need in the other function

def test_algorithm(main_graph, algorithm, comm, df1=None):
    unfiltered_coms = algorithm(main_graph)
    print(unfiltered_coms.communities)
    dic = dict(unfiltered_coms.to_node_community_map())

    filtered_clustering = normalization(comm, unfiltered_coms)
    print(filtered_clustering.communities)

    #comm = normalization(unfiltered_coms, comm)

    #print(len(comm.graph.nodes), len(filtered_clustering.graph.nodes))
    #print(len(set(comm.graph.nodes).symmetric_difference(filtered_clustering.graph.nodes)))

    # nmi = comm.overlapping_normalized_mutual_information_LFK(filtered_clustering)

    # nmi = comm.overlapping_normalized_mutual_information_LFK(unfiltered_coms)
    print(comm.communities)
    nmi = comm.normalized_mutual_information(filtered_clustering)
    print(nmi)

    #updatedataFrame(df1, dic, algorithm.__name__)
    #return [unfiltered_coms, dic, nmi]

and finally I test with:

g = nx.karate_club_graph()

referenceClustering= NodeClustering([[0,1,2],[8,14,15,18]],g,
        method_name="reference",
        method_parameters={})
test_algorithm(g,algorithms.louvain,referenceClustering)

and it seems to work, but maybe you can confirm where is the problem.

BradKML commented 3 years ago

data.zip The Graph is in a CSV file and the ground truth is in JSON.

BradKML commented 3 years ago

For the graph creation:

def directed(nodesDictionary):
    G = nx.DiGraph()
    for user in nodesDictionary:
        G.add_node(str(user))
        for Nodes in nodesDictionary[user]:
            if Nodes not in G.nodes():
                G.add_node(str(Nodes))
            G.add_edge(str(Nodes),str(user))

    G.remove_nodes_from(list(nx.isolates(G))) # this is needed to prevent malfunction in CDs
    G = G.subgraph(list(nodesDictionary.keys())) # this is needed to remove redundancy
    return G
Yquetzal commented 3 years ago

In your case, it was the opposite as what I was expecting: you have more nodes in your ground truth than in your graph. so, ok, I changed to

filtered_clustering = normalization(comm, unfiltered_coms)
    comm = normalization(filtered_clustering, comm)

to have both cases. Then I still get an error, but for a different reason: your ground truth is overlapping. So you end up giving as input to nmi an overlapping cluster, and it raises an error. In particular, after filtering, you still have this node: 609096352 in 2 coms in the ground truth.

BradKML commented 3 years ago

Testing with bigger margins

def test_algorithm(main_graph, algorithm, comm, df1):
    unfiltered_coms = algorithm(main_graph)

    dic = dict(unfiltered_coms.to_node_community_map())

    filtered_clustering = normalization(comm, unfiltered_coms)
    new_comm = normalization(filtered_clustering, comm)

    #comm = normalization(unfiltered_coms, comm)

    #print(len(comm.graph.nodes), len(filtered_clustering.graph.nodes))
    #print(len(set(comm.graph.nodes).symmetric_difference(filtered_clustering.graph.nodes)))

    # nmi = comm.overlapping_normalized_mutual_information_LFK(filtered_clustering)

    # nmi = comm.overlapping_normalized_mutual_information_LFK(unfiltered_coms)

    nmi = new_comm.normalized_mutual_information(filtered_clustering)

    updatedataFrame(df1, dic, algorithm.__name__)
    return [unfiltered_coms, dic, nmi]
c:\Users\Brad\Documents\GitHub\twitter\nx_util.py in test_algorithm(main_graph, algorithm, comm, df1)
     80     # nmi = comm.overlapping_normalized_mutual_information_LFK(unfiltered_coms)
     81 
---> 82     nmi = new_comm.normalized_mutual_information(filtered_clustering)
     83 
     84     updatedataFrame(df1, dic, algorithm.__name__)

C:\tools\Anaconda3\lib\site-packages\cdlib\classes\node_clustering.py in normalized_mutual_information(self, clustering)
    558         """
    559 
--> 560         return evaluation.normalized_mutual_information(self, clustering)
    561 
    562     def overlapping_normalized_mutual_information_LFK(self, clustering):

C:\tools\Anaconda3\lib\site-packages\cdlib\evaluation\comparison.py in normalized_mutual_information(first_partition, second_partition)
     66 
     67     from sklearn.metrics import normalized_mutual_info_score
---> 68     return MatchingResult(score=normalized_mutual_info_score(first_partition_c, second_partition_c))
     69 
     70 

C:\tools\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

C:\tools\Anaconda3\lib\site-packages\sklearn\metrics\cluster\_supervised.py in normalized_mutual_info_score(labels_true, labels_pred, average_method)
    854 
    855     """
--> 856     labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
    857     classes = np.unique(labels_true)
    858     clusters = np.unique(labels_pred)

C:\tools\Anaconda3\lib\site-packages\sklearn\metrics\cluster\_supervised.py in check_clusterings(labels_true, labels_pred)
     59         raise ValueError(
     60             "labels_pred must be 1D: shape is %r" % (labels_pred.shape,))
---> 61     check_consistent_length(labels_true, labels_pred)
     62 
     63     return labels_true, labels_pred

C:\tools\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
    253     uniques = np.unique(lengths)
    254     if len(uniques) > 1:
--> 255         raise ValueError("Found input variables with inconsistent numbers of"
    256                          " samples: %r" % [int(l) for l in lengths])
    257 

ValueError: Found input variables with inconsistent numbers of samples: [727, 726]

ground.zip

BradKML commented 3 years ago

@Yquetzal oof that explains it, the data is where the speaker of parliament is also a party member. thx.

Hope that this can get even more streamlined in the near future tho.

Yquetzal commented 3 years ago

Well, if you want to contribute, you are very welcome :) The problem on our side is mostly finding the time to do things :)