Unable to reproduce OGBN-MAG results

lingfanyu commented 3 years ago

Hi HGT authors,

I am not able to reproduce your OGB leaderboard results. I followed your instructions to run your latest code (commit 9c2182f) for 10 times and got average test accuracy 0.4883 and std 0.0053.

The testing accuracy numbers of the 10 runs are: 0.4852, 0.479, 0.4935, 0.4906, 0.496, 0.4911, 0.4912, 0.4861, 0.4889, 0.4817

The ogb version I was using is 1.2.1. I did make sure evaluation is using variance_reduce for better performance. The commands I used to run your code is the following:

python3 preprocess_ogbn_mag.py --output_dir OGB_MAG.pk
for ((run=0;run<10;run=run+1))
do
        dir_name=model_save_${run}
        python3 train_ogbn_mag.py --n_hid 512 --n_layer 4 --n_heads 8 \
                --data_dir ./OGB_MAG.pk --model_dir $dir_name \
                --prev_norm --last_norm --use_RTE --conv_name hgt
        python3 eval_ogbn_mag.py --n_hid 512 --n_layer 4 --n_heads 8 \
                --data_dir ./OGB_MAG.pk --model_dir ${dir_name} \
                --prev_norm --last_norm --use_RTE --conv_name hgt
done

Could you let me know if there is anything I missed?

Thanks! @acbull

acbull commented 3 years ago

Hi:

Is it possible that you can plot your training log so that I can compare mine with you to see what might be the difference?

lingfanyu commented 3 years ago

I did save the running logs using tee command. Maybe you can just look into these logs or generate your plot from the logs.

I put the logs of last five runs on google drive: https://drive.google.com/drive/folders/1d1p8tv6_4qoQtaGsx0sS-Ne6QMJgXfFg

acbull commented 3 years ago

Hi:

It seems weird to me as the convergence speed of your log is slower than mine. Is it possible you can change the n_epoch to 200 and see whether the performance can improve?

lingfanyu commented 3 years ago

OK. I tried training for 200 epochs (only one run). And the performance did not improve (test acc: 0.488329). Training log here: https://drive.google.com/file/d/1U7Pv_L6S_9g0e9Y-a8C0ImrsTpAbF6ay/view?usp=sharing

Have you also tried to reproduce your OGBN-MAG results at your side?

lingfanyu commented 3 years ago

Hi @acbull , any update on how to reproduce MAG results?

acbull commented 3 years ago

I'm running the code again right now and will give you the update then.

lingfanyu commented 3 years ago

Great! Looking forward to hearing back from you soon!

lingfanyu commented 3 years ago

Hi @acbull ,

I am working on a model for heterogeneous graph, and I want to submit to OGB and compare with your HGT. We featurize node types that don't input (like author, topic, affiliation) with pre-trained embeddings like TransE or Metapath2vec (similar to what you mentioned in paper for OAG dataset). So for a fair comparison, I also trained HGT on MAG with pre-trained embeddings as input.

Since I am not able to reproduce your OGB leaderboard results with your official implementation (reproduced test acc: 0.4883), the test accuracy of using pre-trained embeddings for HGT is 0.4980.

So can I just submit my results to OGB leaderboard? I will let OGB maintainer know you still need time to reproduce your results and once you figure out what was going wrong, I will also be happy to try your new code and update my results.

Thanks!

acbull commented 3 years ago

Hi lingfan:

Sorry for the late response.

For the one using the exactly same setting as me, please wait about 1-2 days, I'll give the response to you then.

lingfanyu commented 3 years ago

Sure. Thanks!

acbull commented 3 years ago

Hi Lingfan:

Sorry for the late reply and the trouble.

The batch_size in the eval_ogbn_mag should also be set as 128 (the same in train, or modify the train as 256).

I've re-run the whole framework once, and the test accuracy is 0.509.

Also, if you have more GPU memory, you can increase the sample_width to get higher results.

For the HGT model with TransE embedding input, if you follow a similar setting, it's welcome to submit the results.

Please let me know if you have any further questions.

lingfanyu commented 3 years ago

The batch_size in the eval_ogbn_mag should also be set as 128 (the same in train, or modify the train as 256).

I actually noticed this batch_size difference and thought about this possibility before. So I tried it, but the accuracy did not improve at my side.

Could you please share the detailed commands that you used to re-run the whole framework (including preprocessing dataset, training, and evaluation)? I will verify it again at my side.

Thanks a ton!

lingfanyu commented 3 years ago

@acbull If you want to look at training and testing logs, they are here: https://drive.google.com/drive/folders/1K013flATT04EbchTEwbjx7Q_QmHQHDw0?usp=sharing

Files ending with _stdout.txt are training logs and files ending with _eval.txt are evaluation logs. bz in file name indicates batch size which can also be found at the beginning of each log file as your code prints out configurations.

model_save_0 is trained with batch size 256 and evaluated with batch size 256. Testing accuracy is 0.484878
model_save_1 is trained with batch size 128. When evaluated with batch size 128, testing acc is 0.491279. When evaluated with batch size 256, testing acc is 0.491204.

Let me know what detailed commands you used and I will see if I can reproduce your results.

acbull commented 3 years ago

The command you execute is the same as me.

I tried to run it multiple times, the result is indeed not that stable and the average accuracy this time is lower than my reported one (that time I just run three times). To make the training and testing more stable, we can increase the sample_width. Can you try the following commands?

python3 train_ogbn_mag.py --n_hid 512 --n_layer 4 --n_heads 8 \
                --data_dir ./OGB_MAG.pk --model_dir $dir_name \
                --prev_norm --last_norm --use_RTE --conv_name hgt --sample_width 600 --sample_depth 6

python3 eval_ogbn_mag.py --n_hid 512 --n_layer 4 --n_heads 8 \
                --data_dir ./OGB_MAG.pk --model_dir $dir_name \
                --prev_norm --last_norm --use_RTE --conv_name hgt --sample_width 600 --sample_depth 6

lingfanyu commented 3 years ago

It seems that my GPU memory is too small (15GB) and as a result, the training fails. How much GPU memory is needed to run with --sample_width 600 --sample_depth 6? What's the GPU you use?

lingfanyu commented 3 years ago

I notice that on OGB leaderboard, you said the experiment was done with Tesla K80 (12GB GPU). But my GPU has 15GB and your code runs out of memory on it when I increase sampled_width from 520 to 600 following your suggestion.

Is there anything I can do to reduce GPU memory usage?

tsy19025 commented 3 years ago

Hi, I have the same problem. If I follow your advice and run the following commands for 10 times, the numbers of testing accuracy on VR task are: 0.491 0.485 0.486 0.482 0.488 0.487 0.486 0.487 0.488 0.485

python3 train_ogbn_mag.py --n_hid 512 --n_layer 4 --n_heads 8 --n_epoch 200\
        --data_dir ./OGB_MAG.pk --model_dir $dir_name \
        --prev_norm --last_norm --use_RTE --conv_name hgt --sample_width 600 --sample_depth 6
python3 eval_ogbn_mag.py --n_hid 512 --n_layer 4 --n_heads 8\
            --data_dir ./OGB_MAG.pk --model_dir $dir_name \
        --prev_norm --last_norm --use_RTE --conv_name hgt --sample_width 600 --sample_depth 6

And the training log of the best is here: https://drive.google.com/file/d/10lUs1AXJOKTlvQVZHJHBedwQSlN3lF0d/view?usp=sharing

Is there anything I can do to reproduce your result? Thanks.

acbull commented 3 years ago

Hi all:

Sorry for the trouble.

After you point this out, I run the following code from my side (with the script I mention in the issue) ten times, and the average result is 0.4927 (mean) 0.0061 (std).

python3 train_ogbn_mag.py --n_hid 512 --n_layer 4 --n_heads 8 \
                --data_dir ./OGB_MAG.pk --model_dir $dir_name \
                --prev_norm --last_norm --use_RTE --conv_name hgt --sample_width 520 --sample_depth 6

python3 eval_ogbn_mag.py --n_hid 512 --n_layer 4 --n_heads 8 \
                --data_dir ./OGB_MAG.pk --model_dir $dir_name \
                --prev_norm --last_norm --use_RTE --conv_name hgt --sample_width 520 --sample_depth 6

I changed the eval script so the performance should be more stable.

I've submitted the corrected result to the OGB leaderboard. Please tell me if you still have any problems to reproduce it.

acbull / pyHGT

Unable to reproduce OGBN-MAG results #26