Open BarclayII opened 3 years ago
Is there any new ongoing work to support distributed random walks on DistGraph
?
I'm working on a project where our graph is too large to fit on a single machine, hence the need for DistGraph
, but we'd like to use random-walk neighbors for GraphSAGE rather than N-hop neighbors (which is the current distributed GraphSAGE approach).
Related question: would it be straightforward to implement a distributed random walk using DGL DistGraph
python API functions?
I'm not super familiar with the DGL API yet, but I can see a solution where we re-implement rand_walk()
in terms of the distributed API functions (like via a distributed get_neighbors()
) rather than local graph API.
@erickim555 It's inefficient for random walk to go across different machine. Possibly we will support it by only random walk on the separate partitions, does this meet your requirement? I'm also wondering how big is your graph? Did you meet any performance issue with single machine?
@VoVAllen I see, thanks for the explanation! Random walks limited to a machine's separate partition is acceptable.
My graph is on the order of ~4B nodes, 1B-10B edges. To my understanding, the graph structure itself is large enough that it doesn't seem possible to fit it into a single machine (even on a memory-optimized AWS instance with ~1 TB memory). What do you think?
As there has been an increasing interest in this, I wonder what are the available options out there? A brief googling points me to https://github.com/KnightKingWalk/KnightKing, which is a SOSP'19 paper.
🚀 Feature
We are asked for distributed (metapath-based) random walk support so I'm bookkeeping the issue here.
Motivation
Large-scale network embedding for the following:
Short random walks are also necessary for loss computation in negative sampling, e.g.
Alternatives
Use a third-party tool to sample the random walk paths first. May work for network embedding methods but not that straightforward for loss computation with random walks (like GraphSAGE or GATNE).
Pitch
Efficient distributed random walk support for both homogeneous graph and heterogeneous graph (metapath-based in this case).