BlueBrain / BlueGraph

Python framework for graph analytics and co-occurrence analysis
https://bluegraph.readthedocs.io
Apache License 2.0
31 stars 5 forks source link

Question on BlueGraph compared to other Libraries #93

Closed BradKML closed 3 years ago

BradKML commented 3 years ago
  1. How is BlueGraph different compared to CDLib (with iGraph and NetworkX)?
  2. How is BLueGraph different compared to KarateClub (regarding the number of node embedding algorithms)?
eugeniashurko commented 3 years ago

Thank you very much for this interesting question.

BlueGraph is a meta-library that aims to provide a 'glue' between different graph processing and analytics libraries by implementing a unified API based on the property graph data model. BlueGraph focuses not only on the community detection capabilities or graph representation learning but on a larger set of graph processing and analytics tasks, such as:

Supported libraries are referred to as backends. BlueGraph supports the following backends:

In particular, BlueGraph allows working not only with in-memory graphs but also persistent property graphs (based on Neo4j database).

The main focus of BlueGraph is interfacing that allows stacking analytics tasks and creating custom analytics pipelines independently of the backend chosen to perform these tasks. BlueGraph does not aim to implement graph analytics tasks but to delegate them to user-specified backends.

BlueGraph implements a custom dataframe-based representation of property graphs with optional semantic property typing. See PGFrame interface and its implementations based on PandasPGFrame (SparkPGFrame in progress). Using BlueGraph as a framework, the user sets the desired backend explicitly (this choice, for example, may depend on the scalability or persistence concerns).

CDlib or karateclub, on the other hand, re-use implementations of different community detection or graph representation learning techniques from various graph libraries (such as networkx or igraph) or implement such techniques themselves. In contrast to BlueGraph, They (in particular, CDlib) handle different backend-dependent graph data structures or the choice of the backend itself implicitly.

Toy example of an analytics pipeline based on BlueGraph

Imagine we have a dataframe with occurrences of different terms in different scientific articles and their definitions from Wikipedia. For example:

Term Definition Papers
glucose Glucose is a simple sugar with the molecular formula... [paper1, paper2, paper4]
calcium Calcium is a chemical element with the symbol Ca... [paper2, paper3, paper5]

Using BlueGraph we can perform the following sequence of tasks:

  1. Create a PandasPGFrame for a co-occurrence graph whose nodes correspond to terms, and edges to term co-occurrences in the same papers (quantified by some measure of their co-occurrence frequency). Add definitions of terms to the node properties.
  2. Use scikit-learn based property encoder (ScikitLearnPGEncoder) to convert textual definitions of terms into vectors using a Tf-Idf encoder.
  3. Use graph-tool to compute centrality measures (such as PageRank and betweenness centrality). Here, for example, the user chooses to use graph-tool due to its superior performance in comparison to networkx.
  4. Use networkx to compute community detection based on Louvain algorithm. Here, networkx is chosen, because graph-tool does not implement this algorithm.
  5. Add centrality measures and the community number to the node properties
  6. Use stellargraph's implementation of the GraphSAGE algorithm to compute node embeddings using previously encoded term definitions as node features.
  7. Add produced vectors to node properties.
  8. Train a simple classification model for predicting node's community using generated node embeddings as features.
  9. Create a faiss-based similarity index that allows us to perform node similarity queries based on their node embedding.
  10. Create an instance of EmbeddingPipeline wrapping the previously built (and trained) property encoder, embedder and similarity index.
  11. Save this pipeline to be used later for predicting embedding vectors of new, previously unseen, nodes (terms).
  12. Finally, create an instance of a Neo4j database from the created graph (effectively persisting the graph with all the node/edge properties in a database).

Note that none of these steps require the user to convert the created property graph to any backend-specific representation or to know the interface of the backends.

Possible interaction with CDlib and karateclub

In order to use the custom algorithms implemented by CDlib and karateclub (or the algorithms that these libraries interface to), one can, for example, implement the following:

CDlibCommunityDetector interface (as a part of blugraph.backends.analyse.cdlib.communities) that would provide a connector to CDlib and allow the users to access the rich library of community detection algorithms. KarateclubNode(Graph)Embedder (as a part of blugraph.backends.embed.karateclub.embedders) that would provide a connector to karateclub and allow the users to access the rich library of graph representation learning algorithms.

In both cases, one can reuse an already implemented converter of PGFrame to networkx graphs.

We are actively inviting the community to contribute to BlueGraph!

BradKML commented 3 years ago

Are there any other libraries that speed up Centrality calculations? I was attempting to provide examples for https://github.com/jboynyc/textnets/issues/32 but even then when the graph has more than 1e5 nodes most indices grind to a halt. Comparing this with existing systems with R (CentiServer, CINNA, NetRankR)

BradKML commented 3 years ago

A bit of a dumb question: is there any way of wrapping networkx into being as fast as graph-tool? Alternatively would Networkit provides similar speed-ups?

eugeniashurko commented 3 years ago

I don't think so (though this question should be forwarded rather to networkx or graph-tool maintainers). From graph-tool's main page

Graph-tool is an efficient Python module for manipulation and statistical analysis of graphs (a.k.a. networks). Contrary to most other Python modules with similar functionality, the core data structures and algorithms are implemented in C++, making extensive use of template metaprogramming, based heavily on the Boost Graph Library. This confers it a level of performance that is comparable (both in memory usage and computation time) to that of a pure C/C++ library.