JanusGraph / janusgraph

JanusGraph: an open-source, distributed graph database
https://janusgraph.org
Other
5.26k stars 1.16k forks source link

Super Node Series One: Introduce proxy node type #4420

Open li-boxuan opened 4 months ago

li-boxuan commented 4 months ago

This is a first attempt towards solving a famous problem in graph databases world - super node. A brief introduction to super node problem, and the status quo of JanusGraph, is documented at https://github.com/JanusGraph/janusgraph/discussions/2717. In a nutshell, JanusGraph has already partially addressed the traversal performance issue, but not the memory/storage issue.

Without native support of super node in JanusGraph (and many other graph databases), a lot of users end up building their own workarounds. Probably the most popular approach is to create "meta" vertices, or "proxy" vertices, and let the application layer redistributes the traversals & data, which is cumbersome and error-prone.

This PR aims to introduce proxy nodes, which work like the partition node before, except that proxy vertices are created only on-demand. There are two basic requirements:

1) User intervention should be as minimal as possible. 2) This feature shall bring zero or minimal overhead when there's no super node.

Note that the existing (but discouraged) partition node (a.k.a. vertex cut feature), addresses the 1st problem well but performs very bad at the 2nd requirement.

It takes a few steps to fully address 1st requirement, and this PR only addresses it partially. This PR requires users to explicitly create proxy nodes, connect them with the canonical node, and EXPLICITLY connect edges to the proxy nodes. This mostly fulfills the 2nd requirement: when there's no need, don't introduce new overhead. The drawback is that the write path is pretty cumbersome, but the good news is that, the read path offers seamless experience. Users could do normal traversal queries as if proxy nodes don't exist.

The brief design is as follows: let's say A is a super node, and A connects to a number of vertices with different labels. Let's say we have a proxy node for A, Vpa. We store proxies, an array of IDs, as a vertex property in the canonical node Va. Conversely, we store canonicalId, the ID of Va as a vertex property in Vpa. Every time we need to traverse from Va, we always fetch Vpa, and do the traversal from there. Let's say Va -.-.-.-> Vpa -------> Vb, then when we traverse from Vb, we will find Vpa first, and then we retrieve Va because Vpa is just a proxy for Va.

TODOs in this PR:

TODOs in subsequent PRs:

C.C. @dxtr-1-0 @rngcntr who expressed interests in this project


Thank you for contributing to JanusGraph!

In order to streamline the review of the contribution we ask you to ensure the following steps have been taken:

For all changes:

For code changes:

For documentation related changes:

li-boxuan commented 4 months ago

This work was created by me long time ago (before I joined my current company). Looks like to do any non-trivial contribution to open-source projects, I need to register. Thus, I'll close this PR for now and get back to it once I get approval.

li-boxuan commented 3 months ago

Yeah it looks like I can continue contributing to JanusGraph.