Cannot reproduce performance on the nbody experiment

mouthful commented 2 years ago

Hi Rob,

I am trying to reproduce the experimental results of the N-body system experiment. However, following the provided scripts for data creation and modeling training (charged), I only obtained a $0.0856$ MSE for the SEGNN model and a $0.11576$ MSE for the SEConv model, which are much higher than those reported in the paper [https://arxiv.org/pdf/2110.02905.pdf], $0.0043$ and $0.0116$ respectively. Could you please give me some advice on such phenomena?

The commands I used: python3 -u generate_dataset.py --simulation=charged --num-train 10000 --seed 43 --suffix small python3 main.py --dataset=nbody --epochs=1000 --max_samples=3000 --model=segnn --lmax_h=1 --lmax_attr=1 --layers=4 --hidden_features=64 --subspace_type=weightbalanced --norm=none --batch_size=100 --gpu=1 --weight_decay=1e-12 python3 main.py --dataset=nbody --epochs=1000 --max_samples=3000 --model=seconv --lmax_h=1 --lmax_attr=1 --layers=4 --hidden_features=80 --subspace_type=weightbalanced --conv_type=linear --norm=instance --batch_size=100 --gpu=1 --weight_decay=1e-12
FYI: There is a small bug about incompatible devices of variables in the training code
Here is a training snapshot of the SEConv model. It seems that it is easy for both SEGNN and SEConv to overfit the training data within dozens of epochs.

RobDHess commented 2 years ago

Hi,

Thanks for trying out the repo and reporting this error. I believe I know what caused this problem—there was a change, caused by me, that changes the way e3nn normalises its tensorproduct. This happened while we were merging multiple codebases in the repo you see here today. I fear that something may have gone wrong during this merger and that I was not careful enough to spot it.

I will spend some time in the coming week trying to fix the code and will let you know once I'm confident that the N-body experiment can be reproduced.

mouthful commented 2 years ago

Thanks for your response! Looking forward to the fixed experimental results.

RobDHess commented 2 years ago

Hi!

I have looked at the problem and I have identified at least one mistake that had nothing to do with normalisation. I had forgotten a skip connection, so the network predicted the position instead of the difference vector between old and new position.

This already brings performance much closer to what it should be on the first hundred epochs, specifically ~0.006. I am currently running a full experiment—10,000 epochs—and will let you know how that goes. Also, I will be comparing the normalisation between the older and newer version of e3nn, to make sure that there's no change in performance due to this. I will update in this thread if I find anything.

The problem of variables being on different devices has also been solved, so the script should run fine on the gpu now.

mouthful commented 2 years ago

Hi,

I have re-run the N-body experiment with the updated code, obtaining a $0.00728$ MAE, which is reasonable for me. The slight performance difference may be caused by the fact I re-created your data by myself locally. Thanks for your response again.

RobDHess / Steerable-E3-GNN

Cannot reproduce performance on the nbody experiment #2