Closed claying closed 6 months ago
Hi,
Thanks for your interest! Every script in our repo has been checked and should reproduce our results. I just tried bash pokec.sh
and it could achieve 86.06% after 1287 epochs. Our training log is shown below:
Namespace(dataset='pokec', data_dir='./data/', device=0, seed=42, cpu=False, local_epochs=2000, global_epochs=0, batch_size=550000, runs=1, metric='acc', method='poly', hidden_channels=256, local_layers=7, global_layers=2, num_heads=1, beta=0.9, pre_ln=False, post_bn=True, local_attn=False, lr=0.0005, weight_decay=0.0, in_dropout=0.0, dropout=0.2, global_dropout=0.2, display_step=1, eval_step=9, eval_epoch=1000, save_model=False, model_dir='./model/', save_result=False)
pokec
dataset pokec | num nodes 1632803 | num edge 30622564 | num node feats 65 | num classes 2
MODEL: Polynormer(
(h_lins): ModuleList(
(0): Linear(in_features=65, out_features=256, bias=True)
(1-6): 6 x Linear(in_features=256, out_features=256, bias=True)
)
(local_convs): ModuleList(
(0): GCNConv(65, 256)
(1-6): 6 x GCNConv(256, 256)
)
(lins): ModuleList(
(0): Linear(in_features=65, out_features=256, bias=True)
(1-6): 6 x Linear(in_features=256, out_features=256, bias=True)
)
(lns): ModuleList(
(0-6): 7 x LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(post_bns): ModuleList(
(0-6): 7 x BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(lin_in): Linear(in_features=65, out_features=256, bias=True)
(ln): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(global_attn): GlobalAttn( [0/1955]
(h_lins): ModuleList(
(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
)
(k_lins): ModuleList(
(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
)
(v_lins): ModuleList(
(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
)
(lns): ModuleList(
(0-1): 2 x LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(lin_out): Linear(in_features=256, out_features=256, bias=True)
)
(pred_local): Linear(in_features=256, out_features=2, bias=True)
(pred_global): Linear(in_features=256, out_features=2, bias=True)
)
Epoch: 1008, Loss: 0.4173, Train: 84.55%, Valid: 84.16%, Test: 84.22%, Best Valid: 84.16%, Best Test: 84.22%
Epoch: 1017, Loss: 0.4160, Train: 85.87%, Valid: 85.40%, Test: 85.44%, Best Valid: 85.40%, Best Test: 85.44%
Epoch: 1026, Loss: 0.4164, Train: 86.19%, Valid: 85.77%, Test: 85.79%, Best Valid: 85.77%, Best Test: 85.79%
Epoch: 1035, Loss: 0.4159, Train: 85.12%, Valid: 84.77%, Test: 84.76%, Best Valid: 85.77%, Best Test: 85.79%
Epoch: 1044, Loss: 0.4152, Train: 85.74%, Valid: 85.35%, Test: 85.36%, Best Valid: 85.77%, Best Test: 85.79%
Epoch: 1053, Loss: 0.4155, Train: 84.24%, Valid: 83.78%, Test: 83.77%, Best Valid: 85.77%, Best Test: 85.79%
Epoch: 1062, Loss: 0.4154, Train: 85.99%, Valid: 85.52%, Test: 85.58%, Best Valid: 85.77%, Best Test: 85.79%
Epoch: 1071, Loss: 0.4155, Train: 85.85%, Valid: 85.41%, Test: 85.40%, Best Valid: 85.77%, Best Test: 85.79%
Epoch: 1080, Loss: 0.4148, Train: 85.74%, Valid: 85.29%, Test: 85.37%, Best Valid: 85.77%, Best Test: 85.79%
Epoch: 1089, Loss: 0.4152, Train: 84.59%, Valid: 84.15%, Test: 84.08%, Best Valid: 85.77%, Best Test: 85.79%
Epoch: 1098, Loss: 0.4140, Train: 85.62%, Valid: 85.13%, Test: 85.10%, Best Valid: 85.77%, Best Test: 85.79%
Epoch: 1107, Loss: 0.4138, Train: 85.80%, Valid: 85.33%, Test: 85.31%, Best Valid: 85.77%, Best Test: 85.79%
Epoch: 1116, Loss: 0.4138, Train: 85.09%, Valid: 84.60%, Test: 84.57%, Best Valid: 85.77%, Best Test: 85.79%
Epoch: 1125, Loss: 0.4139, Train: 86.39%, Valid: 85.92%, Test: 85.96%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1134, Loss: 0.4131, Train: 85.99%, Valid: 85.55%, Test: 85.55%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1143, Loss: 0.4133, Train: 85.77%, Valid: 85.29%, Test: 85.26%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1152, Loss: 0.4135, Train: 85.32%, Valid: 84.77%, Test: 84.75%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1161, Loss: 0.4130, Train: 85.93%, Valid: 85.42%, Test: 85.42%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1170, Loss: 0.4133, Train: 86.28%, Valid: 85.77%, Test: 85.77%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1179, Loss: 0.4119, Train: 85.91%, Valid: 85.38%, Test: 85.37%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1188, Loss: 0.4125, Train: 82.09%, Valid: 81.59%, Test: 81.58%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1197, Loss: 0.4123, Train: 86.12%, Valid: 85.59%, Test: 85.58%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1206, Loss: 0.4116, Train: 86.13%, Valid: 85.55%, Test: 85.59%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1215, Loss: 0.4117, Train: 86.16%, Valid: 85.64%, Test: 85.59%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1224, Loss: 0.4113, Train: 86.21%, Valid: 85.64%, Test: 85.65%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1233, Loss: 0.4108, Train: 85.44%, Valid: 84.87%, Test: 84.85%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1242, Loss: 0.4104, Train: 86.04%, Valid: 85.48%, Test: 85.44%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1251, Loss: 0.4109, Train: 85.82%, Valid: 85.24%, Test: 85.23%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1260, Loss: 0.4118, Train: 86.04%, Valid: 85.45%, Test: 85.42%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1269, Loss: 0.4108, Train: 85.72%, Valid: 85.12%, Test: 85.11%, Best Valid: 85.92%, Best Test: 85.96%
Epoch: 1278, Loss: 0.4106, Train: 86.60%, Valid: 85.95%, Test: 85.95%, Best Valid: 85.95%, Best Test: 85.95%
Epoch: 1287, Loss: 0.4106, Train: 86.68%, Valid: 86.00%, Test: 86.06%, Best Valid: 86.00%, Best Test: 86.06%
According to your training log, it seems the model is converging slowly and has not reached full convergence even after 2000 epochs, which never happens from our side. This is definitely something we need to investigate. Can you reproduce your results by trying a different random seed? Btw, can you reproduce the results on ogbn-products?
Hi @Chenhui1016
Thank you for your fast response! I still obtained the same accuracy with a different random seed. And I failed to reproduce the results on ogbn-products (only obtained a test accuracy of 61.47). However, I managed to reproduce the results on smaller datasets like roman-empire, amazon-ratings, etc. In addition, I noticed that your global_epochs
was set to 0 in pokec.sh
rather than 500 as listed in your paper.
The major difference in my setup is that I used an H100 GPU (80G) rather than A6000. I don't know if this would make such a huge gap in performance. Could you test your code on an H100 or A100 (probably with a smaller batch size)?
The global_epochs
shouldn't be the root cause. I believe different GPUs should not make such a big difference either. Have you created a new conda environment and installed all required packages (and their specified version) following our instructions? Can you please let me know your torch_geometric
version?
There was some compatibility issue with CUDA 11.7 on H100. Thus, I used Pytorch with CUDA 11.8. Otherwise, I used the same version of pyg (2.3) as in your README.
I created a new environment and reinstalled everything. Now I obtain the same results as yours. I will check the difference between my old and new environment, and get back to you if I figure out the reason. Thanks for your help!
Dear authors,
Thank you for your interesting work! I tried to reproduce your results for the pokec dataset by running
bash pokec.sh
, but failed. Please find below the training log:Do you have any idea?