[RFC][DataLoader] Support `sample_blocks` API for non-DGLGraph objects

VibhuJawa commented 2 years ago

🚀 Feature

We should add support to sample_blocks API for non-DGLGraph objects to use in the sampling pipeline.

This will follow how we currently support sample_neighbors for Non-DGL Graph objects.

Document: https://hackmd.io/6PJixJj7Ssy19I-N9CV1Ag#Extending-DGL-Graph-Sampling-Pipeline Related Issue: https://github.com/dmlc/dgl/issues/3600

Motivation

Currently, in sample blocks (see below), we iterate for each value of fanout.

Though this approach works when the data is present locally when we want to do this through an external service (say something like GAAS) this can become a bottleneck as we will need to transfer graphs back and forth from the service for each value of fanout.

This will mean we do a bunch of calls like below which can become really slow.

External Sampling Process -> Training Process -> External Sampling Process -> .....

https://github.com/dmlc/dgl/blob/d41d07d0f6cbed17993644b58057e280a9e8f011/python/dgl/dataloading/neighbor_sampler.py#L106-L120

Pitch

We will have to modify the current sample_blocks to check if the underlying object has an attribute for sample_blocks and if yes call that directly and if not call the above iterative sample_neighbors.py

Additional context

nv-dlasalle commented 2 years ago

sample_blocks is the generic method for any sampling that outputs MFGs (e.g. would also apply for RandomWalks).

Could we generalize the sample_neighbors interface such that if fanout is a list (of int's or dict's), it samples multiple layers, and returns a list of frontiers? Then sample blocks could be implemented as:

def sample_blocks(self, g, seed_nodes, exclude_eids=None): 
     output_nodes = seed_nodes 
     blocks = [] 
     frontiers = g.sample_neighbors( 
         seed_nodes, self.fanouts, edge_dir=self.edge_dir, prob=self.prob, 
         replace=self.replace, output_device=self.output_device, 
         exclude_edges=exclude_eids)
     for frontier in reverse(frontiers):
         # the sampler can return a list of frontiers, or a list of tuples, where the
         # first element is the frontier and the second is the src_nodes to avoid
         # having to_block call unique unnecessarily. We'd expect most
         # implementations to return the last layer as only a frontier since its
         # source nodes are not needed for another layer.
         if isinstance(frontier, tuple):
             src_nodes = frontier[1]
             frontier = frontier[0]
         else:
             src_nodes = None
         eid = frontier.edata[EID] 
         block = to_block(frontier, dst_nodes=seed_nodes, src_nodes=src_nodes) 
         block.edata[EID] = eid 
         seed_nodes = block.srcdata[NID] 
         blocks.insert(0, block) 
     return seed_nodes, output_nodes, blocks

And DGLHeteroGraph's sample_neighbors could be wrapped as:

def sample_neighbors(g, nodes, fanout):
    if not isinstance(fanout, Sequence):
        fanout = [fanout]
    frontiers = []
    for i, k in reverse(enumerate(fanout)):
        frontier = sample_neighbors_single_layer(g, nodes, k)
        if i > 0:
            _, dsts = frontier.edges()
            nodes = torch.unique(dsts)
            frontiers.insert(0, (frontier, nodes))
        else:
            frontiers.insert(0, frontier)
    return frontiers

VibhuJawa commented 2 years ago

@nv-dlasalle , I agree with your approach too and actually lines up better with what we want to do with cuGraph.

This was initially suggested in https://github.com/dmlc/dgl/issues/4326 but based on discussion there we decided to add a sample_neighbors API instead (see comment) .

BarclayII commented 2 years ago

I think extending sample_neighbors will change the interface too drastically as sample_neighbors returns a subgraph (i.e a square CSR matrix), while the new sample_blocks interface will essentially return a list of rectangular CSR matrices, meaning that you would not need a separate to_block call.

nv-dlasalle commented 2 years ago

How will we differentiate a sample_blocks for different sampling types?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

BarclayII commented 2 years ago

How about we add a new method sample_multilayer_neighbors that directly returns a list of blocks, and then in NeighborSampler.sample_blocks test if the graph storage has this method?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

VibhuJawa commented 1 year ago

How about we add a new method sample_multilayer_neighbors that directly returns a list of blocks, and then in NeighborSampler.sample_blocks test if the graph storage has this method?

This sounds reasonable to me.

dmlc / dgl