Closed BradKML closed 3 years ago
It's a good question, but I don't think there is a standard answer. What you mean is that the ground truth is incomplete and thus the partition you found has more nodes than the ground truth ? What you can do is use an overlapping evaluation measure (https://cdlib.readthedocs.io/en/latest/reference/eval/cdlib.evaluation.overlapping_normalized_mutual_information_LFK.html), they should accept different sets of nodes, but differences will be considered "errors". Otherwise, you can compute a partial score by keeping only the set of nodes common to both partitions. That is a technique I used for some dynamic partitions in which all communities are not present all the time in the ground truth.
I would suggest using something like NF1 which does not assumes complete partitions nor non-overlapping ones.
@Yquetzal thanks for the input. I kind of want a solution similar to NMI, but if LFK treats the unknown nodes as errors that are no good. If there is a way where only the known nodes are evaluated, that would be very useful.
Also @GiulioRossetti thanks for presenting an alternative for F1 scores.
I don't think that we will include this in the library since it is a specific case. But here is an example of code allowing you to created a "filtered" cluster based on a list of nodes that you want to keep:
unfiltered_coms = algorithms.louvain(g)
nodes_to_keep ={8,14,0,1,2}
filtered_communities = []
for com in unfiltered_coms.communities:
filtered_com=list(set(com).intersection(node_list))
if len(filtered_com)>0:
filtered_communities.append(filtered_com)
filtered_clustering = NodeClustering(filtered_communities,unfiltered_coms.graph,method_name=unfiltered_coms.method_name+"_filtered",method_parameters=unfiltered_coms.method_name)```
Thanks for this code tidbit!
@Yquetzal nodes_to_keep is node_list ? Also even when I sanitized the data, when the two sets match it still returns "ValueError: Both partitions should cover the same node set" and when LFK is used it always output 0 for some reason.
def normalization(a, b):
c = deepcopy(b)
nodes_to_keep = set(a.graph.nodes)
c_new = []
for com in c.communities:
filtered_com = list(set(com).intersection(nodes_to_keep))
if len(filtered_com) > 0:
c_new.append(filtered_com)
return NodeClustering(c_new,c.graph,
method_name=c.method_name+"_filtered",
method_parameters=c.method_parameters)
def test_algorithm(main_graph, algorithm, comm, df1):
unfiltered_coms = algorithm(main_graph)
dic = dict(unfiltered_coms.to_node_community_map())
filtered_clustering = normalization(comm, unfiltered_coms)
comm = normalization(unfiltered_coms, comm)
print(len(comm.graph.nodes), len(filtered_clustering.graph.nodes))
print(len(set(comm.graph.nodes).symmetric_difference(filtered_clustering.graph.nodes)))
# nmi = comm.overlapping_normalized_mutual_information_LFK(filtered_clustering)
# nmi = comm.overlapping_normalized_mutual_information_LFK(unfiltered_coms)
nmi = comm.normalized_mutual_information(filtered_clustering)
updatedataFrame(df1, dic, algorithm.__name__)
return [unfiltered_coms, dic, nmi]
I'm not certain of what you're trying to do, maybe you can share an example of network on which you have the problem. In my opinion, the nodes_to_keep should be initialized with the nodes in the cluster and not the nodes of the graph, so I modified your normalization function for that:
def normalization(a, b):
"""
a is a reference clustering
b is the clustering to normalize
"""
c = deepcopy(b)
nodes_to_keep = {n for com in a.communities for n in com}
c_new = []
for com in c.communities:
filtered_com = list(set(com).intersection(nodes_to_keep))
if len(filtered_com) > 0:
c_new.append(filtered_com)
return NodeClustering(c_new,c.graph,
method_name=c.method_name+"_filtered",
method_parameters=c.method_parameters)
Then I removed the elements I did not need in the other function
def test_algorithm(main_graph, algorithm, comm, df1=None):
unfiltered_coms = algorithm(main_graph)
print(unfiltered_coms.communities)
dic = dict(unfiltered_coms.to_node_community_map())
filtered_clustering = normalization(comm, unfiltered_coms)
print(filtered_clustering.communities)
#comm = normalization(unfiltered_coms, comm)
#print(len(comm.graph.nodes), len(filtered_clustering.graph.nodes))
#print(len(set(comm.graph.nodes).symmetric_difference(filtered_clustering.graph.nodes)))
# nmi = comm.overlapping_normalized_mutual_information_LFK(filtered_clustering)
# nmi = comm.overlapping_normalized_mutual_information_LFK(unfiltered_coms)
print(comm.communities)
nmi = comm.normalized_mutual_information(filtered_clustering)
print(nmi)
#updatedataFrame(df1, dic, algorithm.__name__)
#return [unfiltered_coms, dic, nmi]
and finally I test with:
g = nx.karate_club_graph()
referenceClustering= NodeClustering([[0,1,2],[8,14,15,18]],g,
method_name="reference",
method_parameters={})
test_algorithm(g,algorithms.louvain,referenceClustering)
and it seems to work, but maybe you can confirm where is the problem.
For the graph creation:
def directed(nodesDictionary):
G = nx.DiGraph()
for user in nodesDictionary:
G.add_node(str(user))
for Nodes in nodesDictionary[user]:
if Nodes not in G.nodes():
G.add_node(str(Nodes))
G.add_edge(str(Nodes),str(user))
G.remove_nodes_from(list(nx.isolates(G))) # this is needed to prevent malfunction in CDs
G = G.subgraph(list(nodesDictionary.keys())) # this is needed to remove redundancy
return G
In your case, it was the opposite as what I was expecting: you have more nodes in your ground truth than in your graph. so, ok, I changed to
filtered_clustering = normalization(comm, unfiltered_coms)
comm = normalization(filtered_clustering, comm)
to have both cases. Then I still get an error, but for a different reason: your ground truth is overlapping. So you end up giving as input to nmi an overlapping cluster, and it raises an error. In particular, after filtering, you still have this node: 609096352 in 2 coms in the ground truth.
Testing with bigger margins
def test_algorithm(main_graph, algorithm, comm, df1):
unfiltered_coms = algorithm(main_graph)
dic = dict(unfiltered_coms.to_node_community_map())
filtered_clustering = normalization(comm, unfiltered_coms)
new_comm = normalization(filtered_clustering, comm)
#comm = normalization(unfiltered_coms, comm)
#print(len(comm.graph.nodes), len(filtered_clustering.graph.nodes))
#print(len(set(comm.graph.nodes).symmetric_difference(filtered_clustering.graph.nodes)))
# nmi = comm.overlapping_normalized_mutual_information_LFK(filtered_clustering)
# nmi = comm.overlapping_normalized_mutual_information_LFK(unfiltered_coms)
nmi = new_comm.normalized_mutual_information(filtered_clustering)
updatedataFrame(df1, dic, algorithm.__name__)
return [unfiltered_coms, dic, nmi]
c:\Users\Brad\Documents\GitHub\twitter\nx_util.py in test_algorithm(main_graph, algorithm, comm, df1)
80 # nmi = comm.overlapping_normalized_mutual_information_LFK(unfiltered_coms)
81
---> 82 nmi = new_comm.normalized_mutual_information(filtered_clustering)
83
84 updatedataFrame(df1, dic, algorithm.__name__)
C:\tools\Anaconda3\lib\site-packages\cdlib\classes\node_clustering.py in normalized_mutual_information(self, clustering)
558 """
559
--> 560 return evaluation.normalized_mutual_information(self, clustering)
561
562 def overlapping_normalized_mutual_information_LFK(self, clustering):
C:\tools\Anaconda3\lib\site-packages\cdlib\evaluation\comparison.py in normalized_mutual_information(first_partition, second_partition)
66
67 from sklearn.metrics import normalized_mutual_info_score
---> 68 return MatchingResult(score=normalized_mutual_info_score(first_partition_c, second_partition_c))
69
70
C:\tools\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
C:\tools\Anaconda3\lib\site-packages\sklearn\metrics\cluster\_supervised.py in normalized_mutual_info_score(labels_true, labels_pred, average_method)
854
855 """
--> 856 labels_true, labels_pred = check_clusterings(labels_true, labels_pred)
857 classes = np.unique(labels_true)
858 clusters = np.unique(labels_pred)
C:\tools\Anaconda3\lib\site-packages\sklearn\metrics\cluster\_supervised.py in check_clusterings(labels_true, labels_pred)
59 raise ValueError(
60 "labels_pred must be 1D: shape is %r" % (labels_pred.shape,))
---> 61 check_consistent_length(labels_true, labels_pred)
62
63 return labels_true, labels_pred
C:\tools\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
253 uniques = np.unique(lengths)
254 if len(uniques) > 1:
--> 255 raise ValueError("Found input variables with inconsistent numbers of"
256 " samples: %r" % [int(l) for l in lengths])
257
ValueError: Found input variables with inconsistent numbers of samples: [727, 726]
@Yquetzal oof that explains it, the data is where the speaker of parliament is also a party member. thx.
Hope that this can get even more streamlined in the near future tho.
Well, if you want to contribute, you are very welcome :) The problem on our side is mostly finding the time to do things :)
this might sound a bit specific, but given social graph X and ground truth partition Y. If all nodes of Y are in X, BUT not all of node X is in Y, how does one do evaluation?