dkaslovsky / GraphRole

Automatic feature extraction and node role assignment for transfer learning on graphs (ReFeX & RolX)
MIT License
80 stars 21 forks source link

Can GraphRole be used on large networks? #6

Open rjurney opened 3 years ago

rjurney commented 3 years ago

We are interested in using this on a billion node network. How well does it scale to large graphs? We can partition our network if required, but we don't know if this is a multi-core implementation via networkx or if this is something not likely to scale beyond small networks.

dkaslovsky commented 3 years ago

Hi @rjurney, thanks for your interest in using GraphRole. This package hasn't been tested at the scale you mention and part of the implementation uses Pandas which might have problems at this scale.

One thing to note though is that GraphRole is not dependent on any particular graph library, so it can be integrated with any scalable graph library of your choice. All that needs to be done is to satisfy the required interface and make it discoverable. The steps are:

  1. Subclass the BaseGraphInterface class in graphrole.graph.interface.base.py and implement the required methods
  2. Update the INTERFACES dict in graphrole.graph.interface.__init__.py to make the new subclass discoverable

See full instructions in the README for setting up tests if so desired.

I'd be very interested to know how it works out if you go down this route, please keep me posted!

rjurney commented 3 years ago

@dkaslovsky thanks, this is really helpful. What you've done here is really cool and I am encouraging the Deep Discovery team to implement this using PySpark and GraphFrames and if we do we will contribute it back... but setting up testing and things may take some time. We'll do an intermediate PR to get things started. cc @ajs-dd

dkaslovsky commented 3 years ago

That's really exciting to hear. I've thought about adding a more scalable dataframe library in the past, so I'm really excited that you and your team might look into implementing and I'd be grateful for any contribution back to GraphRole. Please let me know if there's any help I can provide along the way!

dkaslovsky commented 3 years ago

Oh, one other thought I forgot to mention is that Dask might also be a good option to explore for distributed dataframe functionality.

rjurney commented 3 years ago

@dkaslovsky yeah, but we have a 1.5 billion node business graph so we need it to work across multiple machines and have graph rather than just DataFrame abstractions. This is why GraphFrames is really nice. It is on Spark and uses DataFrames but has graph operations.

https://graphframes.github.io/graphframes/docs/_site/index.html

dkaslovsky commented 3 years ago

Ah, I see. A graphframes-based implementation sounds very appealing!

rjurney commented 1 year ago

@dkaslovsky in which PR? How?

dkaslovsky commented 1 year ago

@rjurney Apologies, reopening, this was in error.