Closed initial-h closed 2 years ago
Mmmm that's weird. I use the default configuration in the config.yaml for Atari games. On my end, it uses around 5G of cuda with 16fp, and around 7G with 32fp (--precision 32).
Did you got that error on the beginning of training, or was a memory leak thing? Otherwise, what I would try would be to make sure pytorch/cuda are the latest version. Maybe that would help?
It was on the beginning of training. Can you tell me the version of your pytorch and gym[atari]? I had some error when I used conda, so I intallled them using pip. I'm not sure if there are some differences.
Logdir /root/logdir/atari_pong/dreamerv2/1
Create envs.
A.L.E: Arcade Learning Environment (version 0.7.5+db37282)
[Powered by Stella]
Create agent.
setting memory format to channels last
setting memory format to channels last
setting memory format to channels last
setting memory format to channels last
setting memory format to channels last
setting memory format to channels last
setting memory format to channels last
setting memory format to channels last
setting fp16
Load agent
Init optimizer - model
Init optimizer - actor
Init optimizer - critic
./common/utils.py:365: UserWarning: This overload of addcmul_ is deprecated:
addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1174.)
exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
Start evaluation.
Eval episode has 2192 steps and return -12.0.
[546352] eval_transitions 1.3e4 / eval_return -12 / eval_length 2192 / eval_eps 10 / memory_usage_kb 6.2e6
Start training.
[546356] kl_loss 2.43 / image_loss 1.1e4 / reward_loss 0.92 / discount_loss 7.9e-3 / model_kl 2.43 / prior_ent 27.9 / post_ent 25.42 / model_loss 1.1e4 / model_grad_norm inf / model_loss_scale 1.6e4 / actor_loss 0.03
/ actor_grad_norm 0.02 / actor_loss_scale 6.6e4 / critic_loss 0.91 / critic_grad_norm 0.13 / critic_loss_scale 6.6e4 / reward_mean -0.02 / reward_std 0.09 / critic_slow -1.5 / critic_target -1.57 / discount 1 / actor_ent 1.78 / actor_ent_scale 1e-3 / actor_logits_mse 8.47 / z_actor_logits_policy_max 13.59 / z_actor_logits_policy_min -12.48 / critic -1.59 / fps 0
Training Error: CUDA out of memory. Tried to allocate 214.00 MiB (GPU 0; 23.65 GiB total capacity; 19.50 GiB already allocated; 134.31 MiB free; 22.20 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "dreamerv2/train.py", line 273, in <module>
raise e
File "dreamerv2/train.py", line 263, in <module>
train_driver(agnt.policy, steps=config.eval_every)
File "./common/driver.py", line 67, in __call__
[callback(tran, **self._kwargs) for callback in self._on_steps]
File "./common/driver.py", line 67, in <listcomp>
[callback(tran, **self._kwargs) for callback in self._on_steps]
File "dreamerv2/train.py", line 242, in train_step
_, mets = agnt.train(next_batch(train_dataset))
File "/opt/dreamerV2-pytorch/dreamerv2/agent.py", line 114, in train
metrics.update(self._task_behavior.train(self.wm, start, reward))
File "/opt/dreamerV2-pytorch/dreamerv2/agent.py", line 333, in train
self.critic_opt.backward(critic_loss)
File "./common/utils.py", line 105, in backward
self._scaler.scale(loss).backward(retain_graph=retain_graph)
File "/opt/dreamer_env/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/dreamer_env/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 214.00 MiB (GPU 0; 23.65 GiB total capacity; 19.50 GiB already allocated; 134.31 MiB free; 22.20 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The good thing of installing pytorch with conda is that it also installs cuda, and I think is usually more optimized than pip version.
ale-py 0.7.3
atari-py 0.2.9
gym 0.21.0
pytorch 1.10.2 py3.9_cuda11.3_cudnn8.2.0_0 pytorch
cudatoolkit 11.3.1 h2bc3f7f_2
those are the specific versions that my env uses. Maybe this is another sign that for these projects is useful to provide docker.
Btw, for installing pytorch with conda is better just to follow the instructions on their website (https://pytorch.org/). Now realized that maybe recommending an old conda version might not be the best, although not sure.
Hi,
Thanks for your help. Unfortunately, there are still some other library version problems.
Could you please provide a list for all your libraries' versions, like
pip freeze > requirements.txt
or
conda list -e > requirements.txt
, if you install all of them using pip or conda.
Thanks a lot!
I would recommend you just to create a new virtualenv/conda environment from scratch, and install pytorch and the requirements.txt.
Here the files anyway: conda_req.txt pip_req.txt
Thanks! Finally I run it successfully! There was a small error from utils.py: GIF summaries require ffmpeg in $PATH.'. I think it is used to save logs. I commented these lines (L168-L170), then it worked.
Glad it worked! The those lines are probably for saving the agent gifs. You can just install ffmpeg if you want that functionality.
Hi, Thanks for your update! I tried to run: python3 dreamerv2/train.py --logdir ~/logdir/atari_pong/dreamerv2/1 --configs defaults atari --task atari_pong , but got an OOM error. What's your configuration, I think my GPU memory is big enough (24G), do you have any idea about it?
RuntimeError: CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 23.65 GiB total capacity; 19.76 GiB already allocated; 24.31 MiB free; 22.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.
Thanks again!