Closed geranim0 closed 3 months ago
Hi @brodequin-slaps, thanks for reporting this. In the meantime, I will check that there are no reproducibility problems in our code.
PS. We run that experiment with cfg.torch_deterministic=False
cc: @belerico
Could it be the seed? I've also seen a large variance in Hafner results' on some environments given different seeds. We can maybe try to run another experiment with a different seed?
Also, which SheepRL version or commit are you using?
Hi @belerico ,
Using commit e8a68f33dac5684c2dc0659c31ff8999d58659c5
Since using the same seed as upstream, it would make sense to me that the results obtained would match those advertised (especially if ran with cfg.torch_deterministic=True
), that way it can give potential library users early confidence that they can repro advertised results. Maybe once my deterministic pacman run finishes (it's about a 2x slowdown - should finish tomorrow), someone could try it too with the same seed (5) to see if it matches.
Will also try different seeds after that
Sure, I try to run other experiments. In the meantime, can you share with us your torch version and the version of your cuda drivers?
Thanks
(.venv) sam@sam:~/dev/ml/sheeprl$ python -c "import torch; print(torch.__version__)"
2.2.1+cu121 # Torch
(.venv) sam@sam:~/dev/ml/sheeprl$ nvidia-smi
Wed Mar 6 08:36:26 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.07 Driver Version: 537.34 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1080 On | 00000000:02:00.0 On | N/A |
| 49% 64C P0 152W / 200W | 7696MiB / 8192MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
Hi @brodequin-slaps, can you try out this branch? You can try to run a reproducible experiment with the following settings:
python sheeprl.py \
exp=dreamer_v3_100k_ms_pacman \
fabric.devices=1 \
fabric.accelerator=cuda \
torch_use_deterministic_algorithms=True \
torch_backends_cudnn_benchmark=False \
torch_backends_cudnn_deterministic=True \
cublas_workspace_config=":4096:8"
where cublas_workspace_config=":4096:8"
comes from here, while torch_use_deterministic_algorithms=True
, torch_backends_cudnn_benchmark=False
and torch_backends_cudnn_deterministic=True
comes from the PyTorch reproducibility page.
While I haven't tried specifically with Dreamer-V3, I've run some simple and fast expriments with PPO:
where:
P.S. the script has been run with:
python sheeprl.py exp=ppo fabric.devices=1 num_threads=4 algo.mlp_keys.encoder=\[\] algo.cnn_keys.encoder=\["rgb"\] fabric.accelerator=cuda
python sheeprl.py exp=ppo fabric.devices=1 num_threads=4 algo.mlp_keys.encoder=\[\] algo.cnn_keys.encoder=\["rgb"\] fabric.accelerator=cuda torch_use_deterministic_algorithms=True torch_backends_cudnn_benchmark=False torch_backends_cudnn_deterministic=True cublas_workspace_config=":4096:8"
Hi @belerico ,
Tried the fix/determinism
branch with the deterministic command above, and locally, the runs are deterministic
However, they don't correspond to your experiments, maybe something's different in our setup
Hi @brodequin-slaps, I don't think that one can achieve the perfect determinism on completely different hardware: https://discuss.pytorch.org/t/how-to-get-determistic-behavior-with-different-gpus/125640
Hi,
Ran the default main branch on the dreamer_v3_100k_ms_pacman experiment (seed 5), but could not repro the rewards advertised
Advertised curve:
When I run it locally with defaulted everything:
Wondering what could explain the difference?
Edit: Found out about deterministic mode which is disabled by default. Will update with the deterministic run results once finished
Edit: Finished run: