Closed Skeleton003 closed 4 months ago
To trigger regression tests:
@dgl-bot run [instance-type] [which tests] [compare-with-branch]
;
For example: @dgl-bot run g4dn.4xlarge all dmlc/master
or @dgl-bot run c5.9xlarge kernel,api dmlc/master
@Rhett-Ying Benchmark shows that the variation on performance is acceptable. I'am trying to find out a way to enable all replicas to obtain a random seed from the main process instead of letting user manually set it, but this is yet another topic. For now, I think we can merge this PR first.
num_ids: 36, num_workers: 2
num_ids
is the total number of ItemSet
or ItemSetDict
? If yes, it's too small and not persuasive.
benchmark on /dgl/examples/multigpu/graphbolt/node_classification.py
:
Old:
$ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3
Training with 4 gpus.
The dataset is already preprocessed.
Training...
48it [00:02, 16.06it/s]
Validating...
10it [00:00, 21.67it/s]
Epoch 00000 | Average Loss 2.3267 | Accuracy 0.7917 | Time 3.5637
48it [00:02, 21.37it/s]
Validating...
10it [00:00, 24.19it/s]
Epoch 00001 | Average Loss 0.9559 | Accuracy 0.8437 | Time 2.7528
48it [00:02, 21.33it/s]
Validating...
10it [00:00, 24.10it/s]
Epoch 00002 | Average Loss 0.7238 | Accuracy 0.8602 | Time 2.7597
48it [00:02, 21.33it/s]
Validating...
10it [00:00, 24.51it/s]
Epoch 00003 | Average Loss 0.6163 | Accuracy 0.8706 | Time 2.7502
48it [00:02, 21.45it/s]
Validating...
10it [00:00, 24.45it/s]
Epoch 00004 | Average Loss 0.5578 | Accuracy 0.8762 | Time 2.7404
48it [00:02, 20.19it/s]
Validating...
10it [00:00, 24.57it/s]
Epoch 00005 | Average Loss 0.5176 | Accuracy 0.8819 | Time 2.8776
48it [00:02, 21.50it/s]
Validating...
10it [00:00, 24.13it/s]
Epoch 00006 | Average Loss 0.4883 | Accuracy 0.8855 | Time 2.7396
48it [00:02, 21.42it/s]
Validating...
10it [00:00, 24.41it/s]
Epoch 00007 | Average Loss 0.4667 | Accuracy 0.8881 | Time 2.7437
48it [00:02, 21.31it/s]
Validating...
10it [00:00, 24.19it/s]
Epoch 00008 | Average Loss 0.4477 | Accuracy 0.8889 | Time 2.7596
48it [00:02, 21.46it/s]
Validating...
10it [00:00, 24.29it/s]
Epoch 00009 | Average Loss 0.4343 | Accuracy 0.8920 | Time 2.7416
Testing...
541it [00:19, 27.95it/s]
Test Accuracy 0.7348
New:
$ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3
Training with 4 gpus.
The dataset is already preprocessed.
Training...
48it [00:03, 15.84it/s]
Validating...
10it [00:00, 22.02it/s]
Epoch 00000 | Average Loss 2.3048 | Accuracy 0.7777 | Time 3.5975
48it [00:02, 21.28it/s]
Validating...
10it [00:00, 25.05it/s]
Epoch 00001 | Average Loss 0.9804 | Accuracy 0.8388 | Time 2.7448
48it [00:02, 21.31it/s]
Validating...
10it [00:00, 24.98it/s]
Epoch 00002 | Average Loss 0.7427 | Accuracy 0.8587 | Time 2.7464
48it [00:02, 21.43it/s]
Validating...
10it [00:00, 25.03it/s]
Epoch 00003 | Average Loss 0.6308 | Accuracy 0.8696 | Time 2.7333
48it [00:02, 21.40it/s]
Validating...
10it [00:00, 25.19it/s]
Epoch 00004 | Average Loss 0.5623 | Accuracy 0.8785 | Time 2.7332
48it [00:02, 20.29it/s]
Validating...
10it [00:00, 24.69it/s]
Epoch 00005 | Average Loss 0.5228 | Accuracy 0.8815 | Time 2.8657
48it [00:02, 21.37it/s]
Validating...
10it [00:00, 24.89it/s]
Epoch 00006 | Average Loss 0.4937 | Accuracy 0.8850 | Time 2.7418
48it [00:02, 21.41it/s]
Validating...
10it [00:00, 25.01it/s]
Epoch 00007 | Average Loss 0.4696 | Accuracy 0.8879 | Time 2.7378
48it [00:02, 21.36it/s]
Validating...
10it [00:00, 25.03it/s]
Epoch 00008 | Average Loss 0.4537 | Accuracy 0.8909 | Time 2.7409
48it [00:02, 21.40it/s]
Validating...
10it [00:00, 24.88it/s]
Epoch 00009 | Average Loss 0.4388 | Accuracy 0.8932 | Time 2.7407
Testing...
541it [00:19, 27.96it/s]
Test Accuracy 0.7393
Old:
$ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3 --dataset ogbn-arxiv
Training with 4 gpus.
The dataset is already preprocessed.
Training...
22it [00:01, 21.57it/s]
Validating...
8it [00:00, 52.40it/s]
Epoch 00000 | Average Loss 3.2543 | Accuracy 0.3002 | Time 1.2109
22it [00:00, 54.33it/s]
Validating...
8it [00:00, 70.41it/s]
Epoch 00001 | Average Loss 2.5287 | Accuracy 0.4404 | Time 0.5230
22it [00:00, 59.90it/s]
Validating...
8it [00:00, 71.66it/s]
Epoch 00002 | Average Loss 2.1985 | Accuracy 0.5054 | Time 0.4818
22it [00:00, 54.64it/s]
Validating...
8it [00:00, 86.39it/s]
Epoch 00003 | Average Loss 1.9795 | Accuracy 0.5349 | Time 0.4978
22it [00:00, 57.34it/s]
Validating...
8it [00:00, 78.11it/s]
Epoch 00004 | Average Loss 1.8419 | Accuracy 0.5529 | Time 0.4944
22it [00:00, 42.99it/s]
Validating...
8it [00:00, 73.39it/s]
Epoch 00005 | Average Loss 1.7533 | Accuracy 0.5649 | Time 0.6252
22it [00:00, 56.13it/s]
Validating...
8it [00:00, 76.69it/s]
Epoch 00006 | Average Loss 1.6852 | Accuracy 0.5713 | Time 0.5014
22it [00:00, 52.51it/s]
Validating...
8it [00:00, 79.52it/s]
Epoch 00007 | Average Loss 1.6405 | Accuracy 0.5766 | Time 0.5221
22it [00:00, 59.19it/s]
Validating...
8it [00:00, 67.85it/s]
Epoch 00008 | Average Loss 1.6055 | Accuracy 0.5814 | Time 0.4923
22it [00:00, 60.42it/s]
Validating...
8it [00:00, 71.80it/s]
Epoch 00009 | Average Loss 1.5681 | Accuracy 0.5878 | Time 0.4783
Testing...
12it [00:00, 82.86it/s]
Test Accuracy 0.5271
New:
$ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3 --dataset ogbn-arxiv
Training with 4 gpus.
The dataset is already preprocessed.
Training...
22it [00:01, 18.31it/s]
Validating...
8it [00:00, 54.37it/s]
Epoch 00000 | Average Loss 3.1735 | Accuracy 0.2941 | Time 1.3790
22it [00:00, 58.89it/s]
Validating...
8it [00:00, 78.07it/s]
Epoch 00001 | Average Loss 2.4895 | Accuracy 0.4520 | Time 0.4908
22it [00:00, 56.94it/s]
Validating...
8it [00:00, 73.67it/s]
Epoch 00002 | Average Loss 2.1515 | Accuracy 0.5135 | Time 0.5007
22it [00:00, 54.02it/s]
Validating...
8it [00:00, 69.11it/s]
Epoch 00003 | Average Loss 1.9372 | Accuracy 0.5381 | Time 0.5256
22it [00:00, 56.69it/s]
Validating...
8it [00:00, 70.72it/s]
Epoch 00004 | Average Loss 1.8119 | Accuracy 0.5560 | Time 0.5067
22it [00:00, 39.94it/s]
Validating...
8it [00:00, 74.97it/s]
Epoch 00005 | Average Loss 1.7279 | Accuracy 0.5639 | Time 0.6646
22it [00:00, 56.77it/s]
Validating...
8it [00:00, 79.99it/s]
Epoch 00006 | Average Loss 1.6723 | Accuracy 0.5734 | Time 0.4928
22it [00:00, 60.43it/s]
Validating...
8it [00:00, 71.34it/s]
Epoch 00007 | Average Loss 1.6253 | Accuracy 0.5817 | Time 0.4789
22it [00:00, 58.53it/s]
Validating...
8it [00:00, 91.09it/s]
Epoch 00008 | Average Loss 1.5881 | Accuracy 0.5844 | Time 0.4690
22it [00:00, 56.57it/s]
Validating...
8it [00:00, 77.58it/s]
Epoch 00009 | Average Loss 1.5577 | Accuracy 0.5878 | Time 0.4972
Testing...
12it [00:00, 88.09it/s]
Test Accuracy 0.5279
Old:
$ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3 --dataset ogbn-papers100M
Training with 4 gpus.
The dataset is already preprocessed.
Training...
294it [00:22, 13.15it/s]
Validating...
31it [00:02, 14.12it/s]
Epoch 00000 | Average Loss 1.9491 | Accuracy 0.5924 | Time 24.7810
294it [00:21, 13.65it/s]
Validating...
31it [00:02, 14.54it/s]
Epoch 00001 | Average Loss 1.3033 | Accuracy 0.6245 | Time 23.8770
294it [00:21, 13.64it/s]
Validating...
31it [00:02, 14.58it/s]
Epoch 00002 | Average Loss 1.2215 | Accuracy 0.6469 | Time 23.8830
294it [00:21, 13.65it/s]
Validating...
31it [00:02, 14.56it/s]
Epoch 00003 | Average Loss 1.1796 | Accuracy 0.6448 | Time 23.8804
294it [00:21, 13.65it/s]
Validating...
31it [00:02, 14.58it/s]
Epoch 00004 | Average Loss 1.1523 | Accuracy 0.6533 | Time 23.8787
294it [00:21, 13.58it/s]
Validating...
31it [00:02, 14.54it/s]
Epoch 00005 | Average Loss 1.1338 | Accuracy 0.6464 | Time 23.9888
294it [00:21, 13.64it/s]
Validating...
31it [00:02, 14.55it/s]
Epoch 00006 | Average Loss 1.1200 | Accuracy 0.6503 | Time 23.8843
294it [00:21, 13.64it/s]
Validating...
31it [00:02, 14.52it/s]
Epoch 00007 | Average Loss 1.1080 | Accuracy 0.6569 | Time 23.8870
294it [00:21, 13.64it/s]
Validating...
31it [00:02, 14.53it/s]
Epoch 00008 | Average Loss 1.0979 | Accuracy 0.6615 | Time 23.8950
294it [00:21, 13.65it/s]
Validating...
31it [00:02, 14.53it/s]
Epoch 00009 | Average Loss 1.0894 | Accuracy 0.6603 | Time 23.8899
Testing...
53it [00:03, 14.50it/s]
Test Accuracy 0.6318
New:
$ python /home/ubuntu/dgl/examples/multigpu/graphbolt/node_classification.py --gpu 0,1,2,3 --dataset ogbn-papers100M
Training with 4 gpus.
The dataset is already preprocessed.
Training...
294it [00:21, 13.69it/s]
Validating...
31it [00:02, 14.19it/s]
Epoch 00000 | Average Loss 1.9418 | Accuracy 0.5957 | Time 23.8790
294it [00:20, 14.18it/s]
Validating...
31it [00:02, 14.65it/s]
Epoch 00001 | Average Loss 1.3039 | Accuracy 0.6233 | Time 23.0518
294it [00:20, 14.19it/s]
Validating...
31it [00:02, 14.57it/s]
Epoch 00002 | Average Loss 1.2206 | Accuracy 0.6458 | Time 23.0501
294it [00:20, 14.18it/s]
Validating...
31it [00:02, 14.62it/s]
Epoch 00003 | Average Loss 1.1800 | Accuracy 0.6493 | Time 23.0555
294it [00:20, 14.17it/s]
Validating...
31it [00:02, 14.54it/s]
Epoch 00004 | Average Loss 1.1533 | Accuracy 0.6571 | Time 23.0787
294it [00:20, 14.11it/s]
Validating...
31it [00:02, 14.58it/s]
Epoch 00005 | Average Loss 1.1354 | Accuracy 0.6563 | Time 23.1551
294it [00:20, 14.19it/s]
Validating...
31it [00:02, 14.56it/s]
Epoch 00006 | Average Loss 1.1197 | Accuracy 0.6585 | Time 23.0504
294it [00:20, 14.18it/s]
Validating...
31it [00:02, 14.57it/s]
Epoch 00007 | Average Loss 1.1088 | Accuracy 0.6571 | Time 23.0587
294it [00:20, 14.21it/s]
Validating...
31it [00:02, 14.53it/s]
Epoch 00008 | Average Loss 1.0991 | Accuracy 0.6616 | Time 23.0182
294it [00:20, 14.20it/s]
Validating...
31it [00:02, 14.57it/s]
Epoch 00009 | Average Loss 1.0909 | Accuracy 0.6632 | Time 23.0365
Testing...
53it [00:03, 14.53it/s]
Test Accuracy 0.6337
Tested on g4dn.metal.
@Rhett-Ying The issue of random seed has been resolved. What a relief that torch.distributed has convenient communicating APIs.
This POC proves to work well both on correctness and performance. Now it's time to finalize the code change.
- Is it possible to update existing
ItemSampler
instead of creating a new class? Seems the major part is fixing theseed
?- is it possible to split the change on
ItemSampler
andItemSet/Dict
to make the change as small as possible for quick review?
I'm afraid the change on ItemSet/Dict
cannot be separated because the new ItemSampler
takes it as input. We have to modify them simultaneously. For the sake of code review, I think we can devide this PR into 2. The first adds ItemSet/Dict4
but remain the old ItemSetDict
unchanged, the second updates the existing ItemSampler
and replaces the old ItemSetDict
with the new. If this is what you envision, I can get started on it right away.
This POC proves to work well both on correctness and performance. Now it's time to finalize the code change.
- Is it possible to update existing
ItemSampler
instead of creating a new class? Seems the major part is fixing theseed
?- is it possible to split the change on
ItemSampler
andItemSet/Dict
to make the change as small as possible for quick review?I'm afraid the change on
ItemSet/Dict
cannot be separated because the newItemSampler
takes it as input. We have to modify them simultaneously. For the sake of code review, I think we can devide this PR into 2. The first addsItemSet/Dict4
but remain the oldItemSetDict
unchanged, the second updates the existingItemSampler
and replaces the oldItemSetDict
with the new. If this is what you envision, I can get started on it right away.
Sounds good to me.
Description
benchmark:
Checklist
Please feel free to remove inapplicable items for your PR.
Changes