GiulioRossetti / cdlib

Community Discovery Library
http://cdlib.readthedocs.io
BSD 2-Clause "Simplified" License
379 stars 73 forks source link

convert_graph_formats() breaks for bipartite networkx graphs, where the nodes are neither strings nor ints #241

Open shashank025 opened 6 months ago

shashank025 commented 6 months ago

Describe the bug

When the input graph for the method cdlib.utils.convert_graph_formats() is a networkx graph where the node type is neither a string not an int (but is hashable as networkx expects), if the graph is bipartite, this method fails when converting to igraph format with the following error when attempting to populate the gi.vs["type"] value:

Traceback (most recent call last):
  File "/Users/shashankr/projects/cdlib/test.py", line 29, in <module>
    convert_graph_formats(G, ig.Graph)
  File "/Users/shashankr/projects/cdlib/cdlib/utils.py", line 197, in convert_graph_formats
    return __from_nx_to_igraph(graph, directed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shashankr/projects/cdlib/cdlib/utils.py", line 141, in __from_nx_to_igraph
    if type(name) == int else a_r[int(name.replace("\\", ""))]
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: '<__main__.Node object at 0x100e5b560>'

To Reproduce Steps to reproduce the behavior:

Here's a script (test.py) to repro the issue:

import networkx as nx
import igraph as ig

from cdlib import algorithms
from cdlib.utils import convert_graph_formats

class Node:
  def __init__(self, id, type):
    self.id = id
    self.type = type

  def __hash__(self):
    return hash(self.id)

  def __eq__(self, other):
    return self.id == other.id

john = Node('John Travolta', 'actor')
nick = Node('Nick Cage', 'actor')
face_off = Node('Face Off', 'movie')

G = nx.Graph()
G.add_node(john)
G.add_node(nick)
G.add_edge(john, face_off, label='ACTED_IN')
G.add_edge(nick, face_off, label='ACTED_IN')

convert_graph_formats(G, ig.Graph)

Simply run with:

python test.py

Expected behavior

I'd expect that call to convert_graph_formats() to work.

Screenshots NA

Additional context

I also have a fix that I'm workin on! I'll produce a PR shortly.

shashank025 commented 6 months ago

The root cause is that the logic inside the bipartite case branch was treating nodes that are just hashable (but neither strings nor ints) as strings. The fix is to properly handle such nodes by maintaining a mapping from strings to corresponding nodes:

diff --git a/cdlib/utils.py b/cdlib/utils.py
index 17a6e03..9ec917b 100644
--- a/cdlib/utils.py
+++ b/cdlib/utils.py
@@ -132,8 +132,9 @@ def __from_nx_to_igraph(g: object, directed: bool = None) -> object:
             gi.add_edges([("\\" + str(u), "\\" + str(v)) for (u, v) in g.edges()])

     if bipartite.is_bipartite(g) and not skip_bipartite:
+        convert = {str(x):x for x in g.nodes()}
         gi.vs["type"] = [
-            a_r[name] if type(name) == int else a_r[int(name.replace("\\", ""))]
+            a_r[name] if type(name) == int else a_r[convert[name.replace("\\", "")]]
             for name in gi.vs["name"]
         ]
GiulioRossetti commented 6 months ago

Hi, I'll look into it - but I cannot guarantee you that support for generic node object descriptors will be integrated.

The main reason for the actual nodetype constraints is that cdlib accommodates several graph representations (networkx, graph_tool, igraph, ad-hoc ones...).

Allowing generic objects as nodes makes it complicated to handle some cross-representation transformations (and could affect downstream visualization/evaluation tasks requiring printable node labels - we cannot assume that all objects will came with well defined and unique string reprs).

Moreover, using complex objects as nodes usually generate a relevant memory footprint: if such information is not needed during computation (rarely the case in CD) it is indeed better to keep it separated from the graph topology (and maybe use it for post-process analysis).

shashank025 commented 6 months ago

Thank you!

I understand that cdlib needs to work with diverse graph representations, etc, and I agree that working with complex objects can have adverse memory impact, but real world applications using cdlib will often already deal with complex node objects. As a library, cdlib will be more "ergonomic" if it can consume such objects as is.

The good thing is this should be relatively easy for cdlib to do, by taking advantage of the fact that node objects are hashable, by simply invoking id(node) on the object, which results in integer node values. Applications can later "recover" the original node object via:

recover = { id(node): node for node in graph.nodes() }

I already see similar code in cdlib:

convert = { str(node): node for node in graph.nodes() }

which makes me suspect the approach I'm suggesting should be easy to apply.

Let me know what you think! I'm actually willing to help with this as well.