Closed nicksukie closed 5 days ago
Change your code to:
indptr = self.graph.adj_tensors('csc')[0]
indices = self.graph.adj_tensors('csc')[1]
fused_graph = gb.fused_csc_sampling_graph(indptr, indices, edge_attributes={'sim': self.graph.edata['sim']})
seed_tensor = seed_nodes.unsqueeze(0) if seed_nodes.dim() == 1 else seed_nodes
item_set = gb.ItemSet(seed_tensor, names="seeds")
datapipe = gb.ItemSampler(item_set, batch_size=len(seed_nodes))
datapipe = datapipe.sample_neighbor(fused_graph, fanouts, replace=False, prob_name='sim' if self.sim_aggregate else None)
edge_attributes
parameter of the gb.fused_csc_sampling_graph
is used to initialize edge attributes of dgl.graphbolt.FusedCSCSamplingGraph
.
Also, I would recommend trying out sample_layer_neighbor
in-place of sample_neighbor
and see if it does the job for you. It is a drop-in replacement. More information about it can be found here: https://docs.dgl.ai/en/latest/generated/dgl.graphbolt.LayerNeighborSampler.html#dgl.graphbolt.LayerNeighborSampler
If you have CUDA enabled GPU, then I would insert a datapipe = datapipe.copy_to('cuda')
right after ItemSampler
line, so that the sampling operation can run on your GPU. For that, you need to move fused_graph
to either pinned memory or the GPU memory.
edge_attributes
parameter of thegb.fused_csc_sampling_graph
is used to initialize edge attributes ofdgl.graphbolt.FusedCSCSamplingGraph
.
This did solve my issue. Thank you.
May I ask what is the benefit of using sample_layer_neighbor
in-place of sample_neighbor
, and how does this affect the output graph format?
Thanks again
This did solve my issue. Thank you.
May I ask what is the benefit of using
sample_layer_neighbor
in-place ofsample_neighbor
, and how does this affect the output graph format?Thanks again/
The output graph format is exactly the same. sample_layer_neighbor
correlates the sampling procedures of your vertices so that the sampled neighborhoods have more overlap. If you do multilayer sampling, you will see that you will have significantly fewer nodes and edges sampled at the end, which improves training throughput. The model convergence is unaffected by this difference.
More information can be found here: https://neurips.cc/virtual/2023/poster/71999
For optimal performance, you should consider performing the sampling and feature fetch operations on the GPU by placing a copy_to
in your sampling pipeline before these operations.
Understood. Thanks for sharing. I will look into this.
I have a follow-up issue. Not sure if you are able to help me with this one too @mfbalin:
Essentially, I want to know how to use Graphbolt for neighborhood aggregation. In older versions it was (where data_flows
are the output of dgl.contrib.sampling.NeighborSampler
):
def encode(self, data_flows, training=True):
# print(data_flows)
x = self.embeddings
nf = next(iter(data_flows))
nf.copy_from_parent()
nf.layers[0].data['activation'] = x[nf.layers[0].data['feature']]
for i, layer in enumerate(self.layers):
h = nf.layers[i].data.pop('activation')
h = F.dropout(h, p=self.dropout, training=training)
nf.layers[i].data['h'] = h
nf.block_compute(i,
fn.copy_src(src='h', out='m'),
lambda node : {'h': node.mailbox['m'].mean(dim=1)},
layer)
h = nf.layers[-1].data.pop('activation')
return h
But Graphbolt doesn't allow for many of the same functions. Any guidance would be appreciated.
Btw, I have also posted my question here: https://discuss.dgl.ai/t/neighbor-sampling-and-aggregation-with-graphbolt/4457
The .blocks
method: https://docs.dgl.ai/en/latest/generated/dgl.graphbolt.MiniBatch.html#dgl.graphbolt.MiniBatch.blocks
Returns the DGL data structures that you can use to do model computations.
I'm trying to sample neighbors from a graph using graphbolt and some pre-calculated probabilities.
My probabilities tensor exists as an attribute of the graph the graph. When I print out
print(self.graph.edata['sim'])
, the tensor shows up clearly:However, when I attempt to conduct neighbor sampling, it is not recognizing my probabilities tensor.
Error:
Perhaps they have to be converted into the
fused_csc_sampling_graph
format. But the input for the prob_name is not clearly specified anywhere other than being a string.It's worth noting that I'm migrating my code from an older version of DGL when neighbor sampling was done via dgl.contrib.sampling.NeighborSampler. Using the old method, it works like a charm, but unfortunately this version is not compatible with my current codebase. I also have not found any explanation of how to migrate code from the contrib to the graphbolt framework for neighbor sampling.
Any insights or assistance is very much appreciated.
Regards.
Environment
conda
,pip
, source): pip