Denys88 / rl_games

RL implementations
MIT License
848 stars 142 forks source link

is it possible to "play" a model without initializing cuda? to avoid memory issues #152

Closed stuartcrobinson closed 2 years ago

stuartcrobinson commented 2 years ago

i'm running this command to "play" my trained model without using the gpu:

python train.py task=Ant test=True checkpoint=cp.pth num_envs=4 sim_device=cpu rl_device=cpu pipeline=cpu

but i still get this CUDA memory error sometimes if i try to run this while a model is being trained in a different terminal window:

Error executing job with overrides: ['task=Ant', 'test=True', 'checkpoint=cp.pth', 'num_envs=4', 'sim_device=cpu', 'rl_device=cpu', 'pipeline=cpu']
Traceback (most recent call last):
  File "train.py", line 134, in <module>
    launch_rlg_hydra()
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/main.py", line 52, in decorated_main
    config_name=config_name,
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/_internal/utils.py", line 378, in _run_hydra
    lambda: hydra.run(
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/_internal/utils.py", line 381, in <lambda>
    overrides=args.overrides,
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "train.py", line 130, in launch_rlg_hydra
    'play': cfg.test,
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 142, in run
    player = self.create_player()
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 128, in create_player
    return self.player_factory.create(self.algo_name, config=self.config)
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/common/object_factory.py", line 15, in create
    return builder(**kwargs)
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 29, in <lambda>
    self.player_factory.register_builder('a2c_continuous', lambda **kwargs : players.PpoPlayerContinuous(**kwargs))
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/players.py", line 28, in __init__
    self.actions_low = torch.from_numpy(self.action_space.low.copy()).float().to(self.device)
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/torch/cuda/__init__.py", line 170, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA error: out of memory

i asked in the nvidia forum too but thought i would check here if it's an unavoidable rl_games thing

https://forums.developer.nvidia.com/t/play-a-checkpoint-file-without-using-gpu-at-all-to-avoid-memory-errors/212764

also, the memory error persists until i reboot. is that a memory leak? or is there any way rl_games could clear the gpu memory?

stuartcrobinson commented 2 years ago

actually something weird is happening... i think there is a memory leak somewhere. i'm getting the memory error after starting and immediately stopping training successively a few times. but not when running a single training command for a long time... i'm going to close this for now and keep investigating....

Denys88 commented 2 years ago

never experienced it before. I had like 100 or more runs with early stopping using ctrl+c. If you can give more details it would be nice.

stuartcrobinson commented 2 years ago

nevermind. sorry about this. i was mistakenly ending training using ctrl-z which left stuff running in the background. thought i saw some docs recommend ctrl-z but i mis-read. i am still curious as to why it has to initialize cuda even when everything is set to use cpu but it doesn't affect me anymore. thank you for an amazing repo btw.