Closed BradKML closed 3 years ago
Thank you very much for this interesting question.
BlueGraph is a meta-library that aims to provide a 'glue' between different graph processing and analytics libraries by implementing a unified API based on the property graph data model. BlueGraph focuses not only on the community detection capabilities or graph representation learning but on a larger set of graph processing and analytics tasks, such as:
Supported libraries are referred to as backends. BlueGraph supports the following backends:
networkx
, graph-tool
, neo4j
stellargraph
, neo4j
, gensim
In particular, BlueGraph allows working not only with in-memory graphs but also persistent property graphs (based on Neo4j database).
The main focus of BlueGraph is interfacing that allows stacking analytics tasks and creating custom analytics pipelines independently of the backend chosen to perform these tasks. BlueGraph does not aim to implement graph analytics tasks but to delegate them to user-specified backends.
BlueGraph implements a custom dataframe-based representation of property graphs with optional semantic property typing. See PGFrame
interface and its implementations based on PandasPGFrame
(SparkPGFrame
in progress). Using BlueGraph as a framework, the user sets the desired backend explicitly (this choice, for example, may depend on the scalability or persistence concerns).
CDlib
or karateclub
, on the other hand, re-use implementations of different community detection or graph representation learning techniques from various graph libraries (such as networkx
or igraph
) or implement such techniques themselves. In contrast to BlueGraph, They (in particular, CDlib
) handle different backend-dependent graph data structures or the choice of the backend itself implicitly.
Imagine we have a dataframe with occurrences of different terms in different scientific articles and their definitions from Wikipedia. For example:
Term | Definition | Papers |
---|---|---|
glucose | Glucose is a simple sugar with the molecular formula... | [paper1, paper2, paper4] |
calcium | Calcium is a chemical element with the symbol Ca... | [paper2, paper3, paper5] |
Using BlueGraph we can perform the following sequence of tasks:
PandasPGFrame
for a co-occurrence graph whose nodes correspond to terms, and edges to term co-occurrences in the same papers (quantified by some measure of their co-occurrence frequency). Add definitions of terms to the node properties.scikit-learn
based property encoder (ScikitLearnPGEncoder
) to convert textual definitions of terms into vectors using a Tf-Idf encoder.graph-tool
to compute centrality measures (such as PageRank and betweenness centrality). Here, for example, the user chooses to use graph-tool
due to its superior performance in comparison to networkx
.networkx
to compute community detection based on Louvain algorithm. Here, networkx
is chosen, because graph-tool
does not implement this algorithm.stellargraph
's implementation of the GraphSAGE algorithm to compute node embeddings using previously encoded term definitions as node features.faiss
-based similarity index that allows us to perform node similarity queries based on their node embedding.EmbeddingPipeline
wrapping the previously built (and trained) property encoder, embedder and similarity index.Note that none of these steps require the user to convert the created property graph to any backend-specific representation or to know the interface of the backends.
CDlib
and karateclub
In order to use the custom algorithms implemented by CDlib
and karateclub
(or the algorithms that these libraries interface to), one can, for example, implement the following:
CDlibCommunityDetector
interface (as a part of blugraph.backends.analyse.cdlib.communities
) that would provide a connector to CDlib
and allow the users to access the rich library of community detection algorithms.
KarateclubNode(Graph)Embedder
(as a part of blugraph.backends.embed.karateclub.embedders
) that would provide a connector to karateclub
and allow the users to access the rich library of graph representation learning algorithms.
In both cases, one can reuse an already implemented converter of PGFrame
to networkx
graphs.
We are actively inviting the community to contribute to BlueGraph!
Are there any other libraries that speed up Centrality calculations?
I was attempting to provide examples for https://github.com/jboynyc/textnets/issues/32 but even then when the graph has more than 1e5 nodes most indices grind to a halt. Comparing this with existing systems with R (CentiServer
, CINNA
, NetRankR
)
A bit of a dumb question: is there any way of wrapping networkx into being as fast as graph-tool? Alternatively would Networkit provides similar speed-ups?
I don't think so (though this question should be forwarded rather to networkx
or graph-tool
maintainers).
From graph-tool's main page
Graph-tool is an efficient Python module for manipulation and statistical analysis of graphs (a.k.a. networks). Contrary to most other Python modules with similar functionality, the core data structures and algorithms are implemented in C++, making extensive use of template metaprogramming, based heavily on the Boost Graph Library. This confers it a level of performance that is comparable (both in memory usage and computation time) to that of a pure C/C++ library.