Open rjurney opened 3 years ago
Hi @rjurney, thanks for your interest in using GraphRole. This package hasn't been tested at the scale you mention and part of the implementation uses Pandas which might have problems at this scale.
One thing to note though is that GraphRole is not dependent on any particular graph library, so it can be integrated with any scalable graph library of your choice. All that needs to be done is to satisfy the required interface and make it discoverable. The steps are:
BaseGraphInterface
class in graphrole.graph.interface.base.py
and implement the required methodsINTERFACES
dict in graphrole.graph.interface.__init__.py
to make the new subclass discoverableSee full instructions in the README for setting up tests if so desired.
I'd be very interested to know how it works out if you go down this route, please keep me posted!
@dkaslovsky thanks, this is really helpful. What you've done here is really cool and I am encouraging the Deep Discovery team to implement this using PySpark and GraphFrames and if we do we will contribute it back... but setting up testing and things may take some time. We'll do an intermediate PR to get things started. cc @ajs-dd
That's really exciting to hear. I've thought about adding a more scalable dataframe library in the past, so I'm really excited that you and your team might look into implementing and I'd be grateful for any contribution back to GraphRole. Please let me know if there's any help I can provide along the way!
Oh, one other thought I forgot to mention is that Dask might also be a good option to explore for distributed dataframe functionality.
@dkaslovsky yeah, but we have a 1.5 billion node business graph so we need it to work across multiple machines and have graph rather than just DataFrame abstractions. This is why GraphFrames is really nice. It is on Spark and uses DataFrames but has graph operations.
https://graphframes.github.io/graphframes/docs/_site/index.html
Ah, I see. A graphframes-based implementation sounds very appealing!
@dkaslovsky in which PR? How?
@rjurney Apologies, reopening, this was in error.
We are interested in using this on a billion node network. How well does it scale to large graphs? We can partition our network if required, but we don't know if this is a multi-core implementation via
networkx
or if this is something not likely to scale beyond small networks.