esteveste / dreamerV2-pytorch

Pytorch implementation of DreamerV2: Mastering Atari with Discrete World Models, based on the original implementation
17 stars 4 forks source link

CUDA out of memory. #3

Closed initial-h closed 2 years ago

initial-h commented 2 years ago

Hi, Thanks for your update! I tried to run: python3 dreamerv2/train.py --logdir ~/logdir/atari_pong/dreamerv2/1 --configs defaults atari --task atari_pong , but got an OOM error. What's your configuration, I think my GPU memory is big enough (24G), do you have any idea about it?

RuntimeError: CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 23.65 GiB total capacity; 19.76 GiB already allocated; 24.31 MiB free; 22.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

Thanks again!

esteveste commented 2 years ago

Mmmm that's weird. I use the default configuration in the config.yaml for Atari games. On my end, it uses around 5G of cuda with 16fp, and around 7G with 32fp (--precision 32).

Did you got that error on the beginning of training, or was a memory leak thing? Otherwise, what I would try would be to make sure pytorch/cuda are the latest version. Maybe that would help?

initial-h commented 2 years ago

It was on the beginning of training. Can you tell me the version of your pytorch and gym[atari]? I had some error when I used conda, so I intallled them using pip. I'm not sure if there are some differences.

Logdir /root/logdir/atari_pong/dreamerv2/1
Create envs.
A.L.E: Arcade Learning Environment (version 0.7.5+db37282)
[Powered by Stella]
Create agent.
setting memory format to channels last
setting memory format to channels last
setting memory format to channels last
setting memory format to channels last
setting memory format to channels last
setting memory format to channels last
setting memory format to channels last
setting memory format to channels last
setting fp16
Load agent
Init optimizer - model
Init optimizer - actor
Init optimizer - critic
./common/utils.py:365: UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  ../torch/csrc/utils/python_arg_parser.cpp:1174.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
Start evaluation.
Eval episode has 2192 steps and return -12.0.
[546352] eval_transitions 1.3e4 / eval_return -12 / eval_length 2192 / eval_eps 10 / memory_usage_kb 6.2e6
Start training.
[546356] kl_loss 2.43 / image_loss 1.1e4 / reward_loss 0.92 / discount_loss 7.9e-3 / model_kl 2.43 / prior_ent 27.9 / post_ent 25.42 / model_loss 1.1e4 / model_grad_norm inf / model_loss_scale 1.6e4 / actor_loss 0.03
/ actor_grad_norm 0.02 / actor_loss_scale 6.6e4 / critic_loss 0.91 / critic_grad_norm 0.13 / critic_loss_scale 6.6e4 / reward_mean -0.02 / reward_std 0.09 / critic_slow -1.5 / critic_target -1.57 / discount 1 / actor_ent 1.78 / actor_ent_scale 1e-3 / actor_logits_mse 8.47 / z_actor_logits_policy_max 13.59 / z_actor_logits_policy_min -12.48 / critic -1.59 / fps 0
Training Error: CUDA out of memory. Tried to allocate 214.00 MiB (GPU 0; 23.65 GiB total capacity; 19.50 GiB already allocated; 134.31 MiB free; 22.20 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "dreamerv2/train.py", line 273, in <module>
    raise e
  File "dreamerv2/train.py", line 263, in <module>
    train_driver(agnt.policy, steps=config.eval_every)
  File "./common/driver.py", line 67, in __call__
    [callback(tran, **self._kwargs) for callback in self._on_steps]
  File "./common/driver.py", line 67, in <listcomp>
    [callback(tran, **self._kwargs) for callback in self._on_steps]
  File "dreamerv2/train.py", line 242, in train_step
    _, mets = agnt.train(next_batch(train_dataset))
  File "/opt/dreamerV2-pytorch/dreamerv2/agent.py", line 114, in train
    metrics.update(self._task_behavior.train(self.wm, start, reward))
  File "/opt/dreamerV2-pytorch/dreamerv2/agent.py", line 333, in train
    self.critic_opt.backward(critic_loss)
  File "./common/utils.py", line 105, in backward
    self._scaler.scale(loss).backward(retain_graph=retain_graph)
  File "/opt/dreamer_env/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/dreamer_env/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 214.00 MiB (GPU 0; 23.65 GiB total capacity; 19.50 GiB already allocated; 134.31 MiB free; 22.20 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
esteveste commented 2 years ago

The good thing of installing pytorch with conda is that it also installs cuda, and I think is usually more optimized than pip version.

ale-py                    0.7.3 
atari-py                  0.2.9
gym                       0.21.0
pytorch                   1.10.2          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
cudatoolkit               11.3.1               h2bc3f7f_2

those are the specific versions that my env uses. Maybe this is another sign that for these projects is useful to provide docker.

Btw, for installing pytorch with conda is better just to follow the instructions on their website (https://pytorch.org/). Now realized that maybe recommending an old conda version might not be the best, although not sure.

initial-h commented 2 years ago

Hi, Thanks for your help. Unfortunately, there are still some other library version problems. Could you please provide a list for all your libraries' versions, like pip freeze > requirements.txt or conda list -e > requirements.txt , if you install all of them using pip or conda.

Thanks a lot!

esteveste commented 2 years ago

I would recommend you just to create a new virtualenv/conda environment from scratch, and install pytorch and the requirements.txt.

Here the files anyway: conda_req.txt pip_req.txt

initial-h commented 2 years ago

Thanks! Finally I run it successfully! There was a small error from utils.py: GIF summaries require ffmpeg in $PATH.'. I think it is used to save logs. I commented these lines (L168-L170), then it worked.

esteveste commented 2 years ago

Glad it worked! The those lines are probably for saving the agent gifs. You can just install ffmpeg if you want that functionality.