dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.19k stars 2.99k forks source link

[GraphBolt] remove buffer in `ItemSampler.__iter__` #7430

Closed Skeleton003 closed 1 month ago

Skeleton003 commented 1 month ago

Description

Since global sharding is applied, buffer is no longer needed when iterating the ItemSampler.

Remove it for simplicity.

Benchmark:

examples/sampling/graphbolt/node_classification.py

Before:

$ python /home/ubuntu/dgl/examples/sampling/graphbolt/node_classification.py 
Training in pinned-cuda mode.
Loading data...
The dataset is already preprocessed.
Training...
Training: 193it [00:08, 21.83it/s]
Evaluating: 39it [00:01, 24.40it/s]
Epoch 00000 | Loss 1.2657 | Accuracy 0.8605 | Time 8.8440
Training: 193it [00:08, 22.78it/s]
Evaluating: 39it [00:01, 24.42it/s]
Epoch 00001 | Loss 0.5886 | Accuracy 0.8753 | Time 8.4753
Training: 193it [00:08, 22.78it/s]
Evaluating: 39it [00:01, 24.41it/s]
Epoch 00002 | Loss 0.4930 | Accuracy 0.8853 | Time 8.4744
Training: 193it [00:08, 22.71it/s]
Evaluating: 39it [00:01, 24.42it/s]
Epoch 00003 | Loss 0.4465 | Accuracy 0.8902 | Time 8.5024
Training: 193it [00:08, 22.80it/s]
Evaluating: 39it [00:01, 24.42it/s]
Epoch 00004 | Loss 0.4220 | Accuracy 0.8917 | Time 8.4664
Training: 193it [00:08, 22.73it/s]
Evaluating: 39it [00:01, 24.39it/s]
Epoch 00005 | Loss 0.4106 | Accuracy 0.8948 | Time 8.4937
Training: 193it [00:08, 22.73it/s]
Evaluating: 39it [00:01, 24.42it/s]
Epoch 00006 | Loss 0.3896 | Accuracy 0.8987 | Time 8.4935
Training: 193it [00:08, 22.79it/s]
Evaluating: 39it [00:01, 24.43it/s]
Epoch 00007 | Loss 0.3754 | Accuracy 0.9005 | Time 8.4725
Training: 193it [00:08, 22.79it/s]
Evaluating: 39it [00:01, 24.40it/s]
Epoch 00008 | Loss 0.3663 | Accuracy 0.9033 | Time 8.4711
Training: 193it [00:08, 22.79it/s]
Evaluating: 39it [00:01, 24.41it/s]
Epoch 00009 | Loss 0.3616 | Accuracy 0.9030 | Time 8.4700
Testing...
598it [00:07, 79.14it/s]
598it [00:17, 34.50it/s]
598it [00:17, 34.14it/s]
Test accuracy 0.7673

After:

$ python /home/ubuntu/dgl/examples/sampling/graphbolt/node_classification.py 
Training in pinned-cuda mode.
Loading data...
The dataset is already preprocessed.
Training...
Training: 193it [00:09, 21.24it/s]
Evaluating: 39it [00:01, 24.46it/s]
Epoch 00000 | Loss nan | Accuracy 0.8554 | Time 9.0887
Training: 193it [00:08, 22.84it/s]
Evaluating: 39it [00:01, 24.47it/s]
Epoch 00001 | Loss nan | Accuracy 0.8743 | Time 8.4546
Training: 193it [00:08, 22.49it/s]
Evaluating: 39it [00:01, 24.47it/s]
Epoch 00002 | Loss nan | Accuracy 0.8832 | Time 8.5839
Training: 193it [00:08, 22.83it/s]
Evaluating: 39it [00:01, 24.48it/s]
Epoch 00003 | Loss nan | Accuracy 0.8845 | Time 8.4581
Training: 193it [00:08, 22.58it/s]
Evaluating: 39it [00:01, 24.44it/s]
Epoch 00004 | Loss nan | Accuracy 0.8941 | Time 8.5520
Training: 193it [00:08, 22.43it/s]
Evaluating: 39it [00:01, 24.48it/s]
Epoch 00005 | Loss nan | Accuracy 0.8952 | Time 8.6080
Training: 193it [00:08, 22.49it/s]
Evaluating: 39it [00:01, 24.49it/s]
Epoch 00006 | Loss nan | Accuracy 0.8911 | Time 8.5830
Training: 193it [00:08, 22.78it/s]
Evaluating: 39it [00:01, 24.48it/s]
Epoch 00007 | Loss nan | Accuracy 0.8951 | Time 8.4745
Training: 193it [00:08, 22.81it/s]
Evaluating: 39it [00:01, 24.48it/s]
Epoch 00008 | Loss nan | Accuracy 0.8991 | Time 8.4632
Training: 193it [00:08, 22.84it/s]
Evaluating: 39it [00:01, 24.48it/s]
Epoch 00009 | Loss nan | Accuracy 0.8917 | Time 8.4522
Testing...
598it [00:07, 76.75it/s]
598it [00:17, 34.47it/s]
598it [00:17, 34.14it/s]
Test accuracy 0.7523

examples/sampling/graphbolt/rgcn/hetero_rgcn.py

Before:

$ python /home/ubuntu/dgl/examples/sampling/graphbolt/rgcn/hetero_rgcn.py
The dataset is already preprocessed.
Loaded dataset: node_classification
node_num for rel_graph_embed: {'author': tensor(1134649, dtype=torch.int32), 'field_of_study': tensor(59965, dtype=torch.int32), 'institution': tensor(8740, dtype=torch.int32)}
Number of embedding parameters: 154029312
Number of model parameters: 337460
Start to train...
Training~Epoch 01: 615it [00:59, 10.30it/s]
Evaluating the model on the validation set.
Inference: 16it [00:00, 21.50it/s]
Finish evaluating on validation set.
Epoch: 01, Loss: 2.3330, Valid accuracy: 47.38%, Time 59.7114
Training~Epoch 02: 615it [00:59, 10.40it/s]
Evaluating the model on the validation set.
Inference: 16it [00:00, 21.61it/s]
Finish evaluating on validation set.
Epoch: 02, Loss: 1.5593, Valid accuracy: 47.69%, Time 59.1281
Training~Epoch 03: 615it [00:57, 10.73it/s]
Evaluating the model on the validation set.
Inference: 16it [00:00, 21.64it/s]
Finish evaluating on validation set.
Epoch: 03, Loss: 1.1594, Valid accuracy: 47.37%, Time 57.2960
Testing...
Inference: 11it [00:00, 21.96it/s]
Test accuracy 46.0311

After:

$ python /home/ubuntu/dgl/examples/sampling/graphbolt/rgcn/hetero_rgcn.py
The dataset is already preprocessed.
Loaded dataset: node_classification
node_num for rel_graph_embed: {'author': tensor(1134649, dtype=torch.int32), 'field_of_study': tensor(59965, dtype=torch.int32), 'institution': tensor(8740, dtype=torch.int32)}
Number of embedding parameters: 154029312
Number of model parameters: 337460
Start to train...
Training~Epoch 01: 615it [00:59, 10.26it/s]
Evaluating the model on the validation set.
Inference: 16it [00:00, 21.48it/s]
Finish evaluating on validation set.
Epoch: 01, Loss: 2.3207, Valid accuracy: 47.72%, Time 59.9646
Training~Epoch 02: 615it [00:59, 10.40it/s]
Evaluating the model on the validation set.
Inference: 16it [00:00, 21.58it/s]
Finish evaluating on validation set.
Epoch: 02, Loss: 1.5420, Valid accuracy: 47.73%, Time 59.1323
Training~Epoch 03: 615it [00:56, 10.91it/s]
Evaluating the model on the validation set.
Inference: 16it [00:00, 21.54it/s]
Finish evaluating on validation set.
Epoch: 03, Loss: 1.1368, Valid accuracy: 46.45%, Time 56.3553
Testing...
Inference: 11it [00:00, 21.85it/s]
Test accuracy 45.5781

Checklist

Please feel free to remove inapplicable items for your PR.

Changes

dgl-bot commented 1 month ago

To trigger regression tests:

dgl-bot commented 1 month ago

Commit ID: 4cdc943bedf3b19558d178b84e6405b2a477e619

Build ID: 1

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link