dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.36k stars 3k forks source link

[GraphBolt] ogbn-arxiv accuracy values are lower than expected. #7523

Closed mfbalin closed 1 month ago

mfbalin commented 2 months ago

🐛 Bug

When we run our examples with the ogbn-arxiv BuiltinDataset, the accuracy numbers we get are far below the multigpu dgl example.

To Reproduce

Steps to reproduce the behavior:

  1. Run any GraphBolt example with ogbn-arxiv such as https://github.com/dmlc/dgl/blob/master/examples/graphbolt/node_classification.py
(venv) mfbalin@BALIN-PC:~/dgl-1/examples/graphbolt/pyg/labor$ time python ../../node_classification.py --mode=cuda-cuda --dataset=ogbn-arxiv
Training in cuda-cuda mode.
Loading data...
The dataset is already preprocessed.
/home/mfbalin/dgl-1/python/dgl/graphbolt/impl/ondisk_dataset.py:851: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  return torch.load(graph_topology.path)
Training...
Training: 89it [00:01, 71.85it/s]
Evaluating: 30it [00:00, 144.78it/s]
Epoch 00000 | Loss 2.5585 | Accuracy 0.5188 | Time 1.2410
Training: 89it [00:00, 91.86it/s]
Evaluating: 30it [00:00, 130.60it/s]
Epoch 00001 | Loss 1.8557 | Accuracy 0.5603 | Time 0.9714
Training: 89it [00:01, 83.19it/s]
Evaluating: 30it [00:00, 121.73it/s]
Epoch 00002 | Loss 1.6663 | Accuracy 0.5764 | Time 1.0715
Training: 89it [00:00, 90.03it/s]
Evaluating: 30it [00:00, 115.70it/s]
Epoch 00003 | Loss 1.5785 | Accuracy 0.5874 | Time 0.9902
Training: 89it [00:01, 83.63it/s]
Evaluating: 30it [00:00, 90.29it/s]
Epoch 00004 | Loss 1.5190 | Accuracy 0.5892 | Time 1.0660
Training: 89it [00:01, 83.89it/s]
Evaluating: 30it [00:00, 111.41it/s]
Epoch 00005 | Loss 1.4832 | Accuracy 0.5985 | Time 1.0627
Training: 89it [00:01, 85.82it/s]
Evaluating: 30it [00:00, 104.45it/s]
Epoch 00006 | Loss 1.4569 | Accuracy 0.5970 | Time 1.0386
Training: 89it [00:01, 86.15it/s]
Evaluating: 30it [00:00, 114.82it/s]
Epoch 00007 | Loss 1.4325 | Accuracy 0.6034 | Time 1.0348
Training: 89it [00:01, 81.71it/s]
Evaluating: 30it [00:00, 108.07it/s]
Epoch 00008 | Loss 1.4173 | Accuracy 0.6041 | Time 1.0909
Training: 89it [00:01, 84.54it/s]
Evaluating: 30it [00:00, 129.23it/s]
Epoch 00009 | Loss 1.3985 | Accuracy 0.6016 | Time 1.0544
Testing...
0it [00:00, ?it/s]/home/mfbalin/dgl-1/python/dgl/graphbolt/itemset.py:181: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  return torch.tensor(index, dtype=dtype)
42it [00:00, 336.81it/s]
42it [00:00, 255.59it/s]
42it [00:00, 222.38it/s]
Test accuracy 0.5377

DGL comparison from regression:

Test name: multi_gpu.bench_dgl_multigpu_node_classification.track_acc 'ogbn-products' | 'cpu-cuda' | '0' | 77.79 'ogbn-products' | 'cpu-cuda' | '0,1' | 77.02 'ogbn-products' | 'cpu-cuda' | '0,1,2,3' | 75.07 'ogbn-products' | 'cpu-cuda' | '0,1,2,3,4,5,6,7' | 73.19 'ogbn-arxiv' | 'cpu-cuda' | '0' | 70.11 'ogbn-arxiv' | 'cpu-cuda' | '0,1' | 69.23 'ogbn-arxiv' | 'cpu-cuda' | '0,1,2,3' | 69.05 'ogbn-arxiv' | 'cpu-cuda' | '0,1,2,3,4,5,6,7' | 67.19

(venv) mfbalin@BALIN-PC:~/dgl-1/examples/graphbolt/pyg/labor$ python ../../../multigpu/node_classification_sage.py --dataset_name=ogbn-arxiv
Training in mixed mode using 1 GPU(s)
Loading data
Downloading http://snap.stanford.edu/ogb/data/nodeproppred/arxiv.zip
Downloaded 0.08 GB: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [00:17<00:00,  4.75it/s]
Extracting dataset/arxiv.zip
Loading necessary files...
This might take a while.
Processing graphs...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 39945.75it/s]
Converting graphs into DGL objects...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 79.29it/s]
Saving...
Training...
Epoch 00000 | Loss 2.2141 | Accuracy 0.6206 | Time 1.5078
Epoch 00001 | Loss 1.4576 | Accuracy 0.6506 | Time 1.1546
Epoch 00002 | Loss 1.3142 | Accuracy 0.6657 | Time 1.0753
Epoch 00003 | Loss 1.2447 | Accuracy 0.6721 | Time 1.1954
Epoch 00004 | Loss 1.1961 | Accuracy 0.6759 | Time 1.2414
Epoch 00005 | Loss 1.1625 | Accuracy 0.6880 | Time 1.2093
Epoch 00006 | Loss 1.1361 | Accuracy 0.6856 | Time 1.1644
Epoch 00007 | Loss 1.1214 | Accuracy 0.6880 | Time 1.2477
Epoch 00008 | Loss 1.1076 | Accuracy 0.6910 | Time 1.2267
Epoch 00009 | Loss 1.0930 | Accuracy 0.6868 | Time 1.1899
Testing...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 166/166 [00:00<00:00, 367.04it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 166/166 [00:00<00:00, 398.72it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 166/166 [00:00<00:00, 270.22it/s]
Test accuracy 0.6786

Expected behavior

A test accuracy closer to 70% is expected while the current accuracy is below 55%.

Environment

Additional context

mfbalin commented 2 months ago

@frozenbugs

mfbalin commented 2 months ago

@Rhett-Ying

Rhett-Ying commented 2 months ago

@mfbalin thanks for reporting this.

@az15240 could you help look into this? please try to reproduce in your local first.

Rhett-Ying commented 2 months ago

@mfbalin which example exactly? please list the single-gpu one and multi-gpu one in the description.

mfbalin commented 2 months ago

Updated the description

Rhett-Ying commented 2 months ago

@az15240 This could be the most possible culprit. please check whether the GraphBolt's BuiltinDataset processes in the same way as DGL.

https://github.com/dmlc/dgl/blob/1ac2da051e8a3086af8e9f6d8c3212bab52c8abe/examples/multigpu/node_classification_sage.py#L371-L373

az15240 commented 1 month ago

@az15240 This could be the most possible culprit. please check whether the GraphBolt's BuiltinDataset processes in the same way as DGL.

https://github.com/dmlc/dgl/blob/1ac2da051e8a3086af8e9f6d8c3212bab52c8abe/examples/multigpu/node_classification_sage.py#L371-L373

This is very likely the cause. I'll update the dataset.

az15240 commented 1 month ago

The dataset is updated. Running python examples/graphbolt/node_classification.py --dataset=ogbn-arxiv will produce an accuracy close to 70%. Please follow up for any questions!

mfbalin commented 1 month ago

The dataset is updated. Running python examples/graphbolt/node_classification.py --dataset=ogbn-arxiv will produce an accuracy close to 70%. Please follow up for any questions!

Can you mention what changes you have made compared to the previous dataset? @az15240

az15240 commented 1 month ago

@az15240 This could be the most possible culprit. please check whether the GraphBolt's BuiltinDataset processes in the same way as DGL.

https://github.com/dmlc/dgl/blob/1ac2da051e8a3086af8e9f6d8c3212bab52c8abe/examples/multigpu/node_classification_sage.py#L371-L373

I added bidirectional edges and self loops to the graphbolt dataset, as mentioned in this reply.