MushroomRL / mushroom-rl

Python library for Reinforcement Learning.
MIT License
803 stars 145 forks source link

suspected memory leak #111

Closed davidenitti closed 1 year ago

davidenitti commented 1 year ago

Describe the bug I run simple DQN on breakout atari game and the memory slowly increases, and after 20-30 epochs it takes 64GB of memory and after that keeps increasing. I use 1 million for the replay memory, but I thought that in 4 epochs of 250k iterations the replay memory should be already full and the used RAM shouldn't increase after that. Am I right? I'm training on CPU, but I guess this shouldn't influence the memory leak.

System information (please complete the following information):

boris-il-forte commented 1 year ago

Dear @davidenitti, The behavior you are reporting is quite unexpected. This memory usage shouldn't happen at all. I suspect there are two possible issues: 1) too much memory is consumed by the logging of the info dictionary. This is a new feature of mushroom and we are still not sure if it's working properly 2) you are using the dataset logging callback. In that case, you are logging the dataset at every step, and this may cause the memory to explode if you don't clean up the collection of samples.

Can you check that the issue is actually not (2)? If the issue is not (2) can you check if you can reproduce with a stable version of mushroom?

davidenitti commented 1 year ago

@boris-il-forte what should I look for in the code to check what you said? I thought the memory increase was due to the memory replay which I set it to 1M as in the nature paper of DQN

boris-il-forte commented 1 year ago

there's a callback called CollectDataset. If you are using this callback, and you are not clearing out the data, you are going to accumulate a disproportional amount of images in a dataset. This is often the problem with memory usage. A quick way to check if (1) is the problem is to roll back to an older version of mushroom (any version before 1.9.0 should be fine) and see if the memory leak is still there. If the memory leak disappears, then the issue is almost surely in the new functionality of mushroom 1.9.0, i.e. the logging of info data. In the case of atari, it is the number of lives and frame numbers.

In case the issue is with (1) I'll take care of looking into it, otherwise I'll ask @carloderamo or @AhmedMagdyHendawy to take a look into it as they maintain the DQN variants.

davidenitti commented 1 year ago

I have the same memory issue with version 1.7.2

davidenitti commented 1 year ago

actually I'm not sure I can test on 1.7.2 because I have bugs, so I'm not sure I can test it. such as

Traceback (most recent call last):
  File "code/RL/benchmark_RL.py", line 39, in <module>
    main_atari.main(params)
  File "/home/administrator/code/RL/main_atari.py", line 437, in main
    core.learn(n_steps=initial_replay_size,
  File "/home/administrator/.local/lib/python3.8/site-packages/mushroom_rl/core/core.py", line 75, in learn
    self._run(n_steps, n_episodes, fit_condition, render, quiet)
  File "/home/administrator/.local/lib/python3.8/site-packages/mushroom_rl/core/core.py", line 125, in _run
    return self._run_impl(move_condition, fit_condition, steps_progress_bar,
  File "/home/administrator/.local/lib/python3.8/site-packages/mushroom_rl/core/core.py", line 139, in _run_impl
    self.reset(initial_states)
  File "/home/administrator/.local/lib/python3.8/site-packages/mushroom_rl/core/core.py", line 216, in reset
    self._state = self._preprocess(self.mdp.reset(initial_state).copy())
  File "/home/administrator/.local/lib/python3.8/site-packages/mushroom_rl/environments/atari.py", line 100, in reset
    self._state = preprocess_frame(self.env.reset(), self._img_size)
  File "/home/administrator/.local/lib/python3.8/site-packages/mushroom_rl/utils/frames.py", line 50, in preprocess_frame
    image = cv2.cvtColor(obs, cv2.COLOR_RGB2GRAY)
cv2.error: OpenCV(4.6.0) :-1: error: (-5:Bad argument) in function 'cvtColor'
> Overload resolution failed:
>  - src is not a numerical tuple
>  - Expected Ptr<cv::UMat> for argument 'src'

I fixed this replacing the line with image = cv2.cvtColor(obs[0], cv2.COLOR_RGB2GRAY), but I have another error

  File "/home/davide/Dropbox/Apps/davide_colab/code/ML/RL/main_atari.py", line 437, in main
    core.learn(n_steps=initial_replay_size,
  File "/usr/local/lib/python3.10/dist-packages/mushroom_rl/core/core.py", line 75, in learn
    self._run(n_steps, n_episodes, fit_condition, render, quiet)
  File "/usr/local/lib/python3.10/dist-packages/mushroom_rl/core/core.py", line 125, in _run
    return self._run_impl(move_condition, fit_condition, steps_progress_bar,
  File "/usr/local/lib/python3.10/dist-packages/mushroom_rl/core/core.py", line 141, in _run_impl
    sample = self._step(render)
  File "/usr/local/lib/python3.10/dist-packages/mushroom_rl/core/core.py", line 189, in _step
    next_state, reward, absorbing, _ = self.mdp.step(action)
  File "/usr/local/lib/python3.10/dist-packages/mushroom_rl/environments/atari.py", line 118, in step
    obs, _, _, _ = self.env.env.step(1)
ValueError: too many values to unpack (expected 4)

that's why I was using the github version

boris-il-forte commented 1 year ago

this issue depends on breaking changes on the Atari interface, caused by the modifications of the interface of OpenAI gym. You have to find a compatible version for Atari. Unfortunately, I cannot tell you which version works, as it was too long ago.

We will try to have a look at this bug during the winter holiday season. Unfortunately, the gym transition caused a lot of hiccups in the interface and in the package maintenance. We are sorry for this, but it doesn't depend fully on us.

If you are willing to share the code that is causing you trouble, we may take a look and see if there's something wrong on your side or if it's a bug in MushroomRL.

davidenitti commented 1 year ago

I reinstalled everything on another server using the github version of mushroom_rl and it seems what I don't have the memory leak anymore. not sure what changed in the meantime, my code to run the experiment is the same.

boris-il-forte commented 1 year ago

I guess then that the issue was probably some torch version or NumPy, or some combination of libraries causing the issue. Closing, for now. Feel free to open the issue again if you have more data/insights into the problem.