dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.55k stars 3.01k forks source link

[FEA] Support a list of integers for fanout in graph.sample_neighbors #4326

Open VibhuJawa opened 2 years ago

VibhuJawa commented 2 years ago

🚀 Feature

Currently, in sample_neigbours we only support a single fanout value (see link), i think it will be useful to look into a way of supporting a list of values with the same call .

Motivation

Currently, in sample blocks (see below), we iterate for each value of fanout. Though this approach works when the data is present locally when we want to do this through an external service (say something like GAAS) this can become a bottleneck as we will need to transfer graphs back and forth from the service for each value of fanout.

This will mean we do a bunch of calls like below which can become really slow.

External Sampling Process -> Training Process -> External Sampling Process -> .....

https://github.com/dmlc/dgl/blob/d41d07d0f6cbed17993644b58057e280a9e8f011/python/dgl/dataloading/neighbor_sampler.py#L106-L120

Alternatives

The other alternative will be to write a NeighborSampler specific to graph-store objects like GAASStore where we don't want to do this back and forth .

Additional context

TODO: Add benchmark

jermainewang commented 2 years ago

Hi @VibhuJawa , by passing a list of integers, say [15, 10 ,5], do you mean the function should return a list of MFGs (or block graphs) with 15, 10, 5 fanouts respectively? If so, we probably should propose a new API of GraphStorage instead of extending the current sample_neighbors.

cc @BarclayII

VibhuJawa commented 2 years ago

Hi @VibhuJawa , by passing a list of integers, say [15, 10 ,5], do you mean the function should return a list of MFGs (or block graphs) with 15, 10, 5 fanouts respectively? If so, we probably should propose a new API of GraphStorage instead of extending the current sample_neighbors.

Yup. Thats exactly what i mean. When you say proposing a new API for GraphStorage are you thinking we should have a new API say something like sample_blocks , similar to sample_neigbors that we implement in the GraphStorage class and then call that function if present ?

The other alternative i was considering was to create a wrapper around the current NeighborSampler and overwrite the sample_nighbours to not make calls one at a time to sample_neighbors .

jermainewang commented 2 years ago

Yup. Thats exactly what i mean. When you say proposing a new API for GraphStorage are you thinking we should have a new API say something like sample_blocks , similar to sample_neigbors that we implement in the GraphStorage class and then call that function if present ?

That's right. Could you create a proposal about the interface? Eventually, we need to add it to the doc of GraphStorage so users know what to implement if they wish it to work with certain samplers. @BarclayII has created a draft doc here.

VibhuJawa commented 2 years ago

That's right. Could you create a proposal about the interface?

Thanks. I will get started on it.

VibhuJawa commented 2 years ago

@jermainewang , Raised RFC here: https://github.com/dmlc/dgl/issues/4498