Open VibhuJawa opened 2 years ago
sample_blocks
is the generic method for any sampling that outputs MFGs (e.g. would also apply for RandomWalks).
Could we generalize the sample_neighbors
interface such that if fanout
is a list (of int
's or dict
's), it samples multiple layers, and returns a list of frontiers? Then sample blocks could be implemented as:
def sample_blocks(self, g, seed_nodes, exclude_eids=None):
output_nodes = seed_nodes
blocks = []
frontiers = g.sample_neighbors(
seed_nodes, self.fanouts, edge_dir=self.edge_dir, prob=self.prob,
replace=self.replace, output_device=self.output_device,
exclude_edges=exclude_eids)
for frontier in reverse(frontiers):
# the sampler can return a list of frontiers, or a list of tuples, where the
# first element is the frontier and the second is the src_nodes to avoid
# having to_block call unique unnecessarily. We'd expect most
# implementations to return the last layer as only a frontier since its
# source nodes are not needed for another layer.
if isinstance(frontier, tuple):
src_nodes = frontier[1]
frontier = frontier[0]
else:
src_nodes = None
eid = frontier.edata[EID]
block = to_block(frontier, dst_nodes=seed_nodes, src_nodes=src_nodes)
block.edata[EID] = eid
seed_nodes = block.srcdata[NID]
blocks.insert(0, block)
return seed_nodes, output_nodes, blocks
And DGLHeteroGraph's sample_neighbors
could be wrapped as:
def sample_neighbors(g, nodes, fanout):
if not isinstance(fanout, Sequence):
fanout = [fanout]
frontiers = []
for i, k in reverse(enumerate(fanout)):
frontier = sample_neighbors_single_layer(g, nodes, k)
if i > 0:
_, dsts = frontier.edges()
nodes = torch.unique(dsts)
frontiers.insert(0, (frontier, nodes))
else:
frontiers.insert(0, frontier)
return frontiers
@nv-dlasalle , I agree with your approach too and actually lines up better with what we want to do with cuGraph.
This was initially suggested in https://github.com/dmlc/dgl/issues/4326 but based on discussion there we decided to add a sample_neighbors API instead (see comment) .
I think extending sample_neighbors
will change the interface too drastically as sample_neighbors
returns a subgraph (i.e a square CSR matrix), while the new sample_blocks
interface will essentially return a list of rectangular CSR matrices, meaning that you would not need a separate to_block
call.
How will we differentiate a sample_blocks
for different sampling types?
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
How about we add a new method sample_multilayer_neighbors
that directly returns a list of blocks, and then in NeighborSampler.sample_blocks
test if the graph storage has this method?
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
How about we add a new method
sample_multilayer_neighbors
that directly returns a list of blocks, and then inNeighborSampler.sample_blocks
test if the graph storage has this method?
This sounds reasonable to me.
š Feature
We should add support to
sample_blocks
API for non-DGLGraph objects to use in the sampling pipeline.This will follow how we currently support
sample_neighbors
for Non-DGL Graph objects.Document: https://hackmd.io/6PJixJj7Ssy19I-N9CV1Ag#Extending-DGL-Graph-Sampling-Pipeline Related Issue: https://github.com/dmlc/dgl/issues/3600
Motivation
Currently, in sample blocks (see below), we iterate for each value of
fanout.
Though this approach works when the data is present locally when we want to do this through an external service (say something like
GAAS
) this can become a bottleneck as we will need to transfer graphs back and forth from the service for each value of fanout.This will mean we do a bunch of calls like below which can become really slow.
https://github.com/dmlc/dgl/blob/d41d07d0f6cbed17993644b58057e280a9e8f011/python/dgl/dataloading/neighbor_sampler.py#L106-L120
Pitch
We will have to modify the current sample_blocks to check if the underlying object has an attribute for
sample_blocks
and if yes call that directly and if not call the above iterative sample_neighbors.pyAdditional context
Related issue: https://github.com/dmlc/dgl/issues/4326