Reproducing CSGO training config

eloialonso / diamond

DIAMOND (DIffusion As a Model Of eNvironment Dreams) is a reinforcement learning agent trained in a diffusion world model. NeurIPS 2024 Spotlight.

https://diamond-wm.github.io

MIT License

1.58k stars 102 forks source link

Reproducing CSGO training config #29

Closed kxhit closed 3 hours ago

kxhit commented 2 weeks ago

Hi thanks for this great work!

I'm using the CSGO branch to reproduce the results in the web demo.

In the paper, section J.3, "trained for 120k updates with a batch size of 64, on up to 4×A6000 GPUs. Each training run took between 1-2 days", while in the readme, "The provided configuration took 12 days on a RTX 4090.". Would you clarify the configurations for the demo checkpoint? And I assume the bs 64 is the global bs for 4 RX 4090 GPUs, no gradient accumulation.

Thank you!

AdamJelley commented 2 weeks ago

Hi @kxhit! The CSGO details in appendix M (previously J) are from early experiments with CSGO and do not correspond to the released demo model. The readme is correct that the released model was trained for 12 days on a RTX 4090. We provide our training code and config so you should be able to reproduce the training by following the training instructions using the default trainer config.

kxhit commented 2 weeks ago

Hi @AdamJelley thanks for the prompt reply!

However, the provided config is OOM on A6000 or GTX 4090, could you double check if denoiser batch size 64, grad_acc 2 will be successfully trained on a 24GB GTX4090? Thank you!

AdamJelley commented 4 days ago

Hi @kxhit , apologies for the delay in getting back to you on this one. We were able to reproduce your OOM error, so looked into it and found there was indeed an issue with the training config (since we originally trained the dynamics and upsampling models separately, but combined them together in the current training code). The latest commit (a4396eb) should fix this issue and enable you to train on a 24GB GPU. Could you confirm if that works for you?

kxhit commented 3 hours ago

Hi @AdamJelley thanks for your active reply!

Yes, I can confirm it can successfully train on my 24GB 4090 GPU. I will double-check the results later. Cheers!