facebookresearch / habitat-lab

A modular high-level library to train embodied AI agents across a variety of tasks and environments.
https://aihabitat.org/
MIT License
1.98k stars 494 forks source link

How to reduce cpu memory usage #630

Closed zhengzaiyi closed 2 years ago

zhengzaiyi commented 3 years ago

Hello, I'm training rgb-sensor baseline with PPO and I encountered the following EOFerror:

Traceback (most recent call last):
  File "habitat_baselines/run.py", line 79, in <module>
    main()
  File "habitat_baselines/run.py", line 40, in main
    run_exp(**vars(args))
  File "habitat_baselines/run.py", line 75, in run_exp
    execute_exp(config, run_type)
  File "habitat_baselines/run.py", line 58, in execute_exp
    trainer.train()
  File "/home/m/anaconda3/envs/habitat/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/home/m/桌面/habitat-lab/habitat_baselines/rl/ppo/ppo_trainer.py", line 769, in train
    buffer_index
  File "/home/m/桌面/habitat-lab/habitat_baselines/rl/ppo/ppo_trainer.py", line 449, in _collect_environment_result
    for index_env in range(env_slice.start, env_slice.stop)
  File "/home/m/桌面/habitat-lab/habitat_baselines/rl/ppo/ppo_trainer.py", line 449, in <listcomp>
    for index_env in range(env_slice.start, env_slice.stop)
  File "/home/m/anaconda3/envs/habitat/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/home/m/桌面/habitat-lab/habitat/core/vector_env.py", line 409, in wait_step_at
    return self._connection_read_fns[index_env]()
  File "/home/m/桌面/habitat-lab/habitat/core/vector_env.py", line 97, in __call__
    res = self.read_fn()
  File "/home/m/桌面/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
    buf = self.recv_bytes()
  File "/home/m/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/m/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/m/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
Exception ignored in: <bound method VectorEnv.__del__ of <habitat.core.vector_env.VectorEnv object at 0x7fa1e7ae07f0>>
Traceback (most recent call last):
  File "/home/m/桌面/habitat-lab/habitat/core/vector_env.py", line 588, in __del__
    self.close()
  File "/home/m/桌面/habitat-lab/habitat/core/vector_env.py", line 456, in close
    read_fn()
  File "/home/m/桌面/habitat-lab/habitat/core/vector_env.py", line 97, in __call__
    res = self.read_fn()
  File "/home/m/桌面/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
    buf = self.recv_bytes()
  File "/home/m/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/m/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/m/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError: 

After I run the command dmesg, the msg is listed as below

[ 3634.303204] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 3634.303334] [   1991]  1000  1991  1154039    22066   765952     9688             0 code
[ 3634.303335] [   2006]  1000  2006  1164007     5220   868352     3517             0 code
[ 3634.303336] [   2028]  1000  2028  1119690     3080   442368     3343             0 code
[ 3634.303337] [   2036]  1000  2036  1117076     1501   389120     2298             0 code
[ 3634.303338] [   2183]  1000  2183  2168791     2010   520192     7317             0 code
[ 3634.303339] [   2269]  1000  2269   254753     5636   417792     1494             0 gnome-terminal-
[ 3634.303340] [   2277]  1000  2277     4878        2    49152      425             0 bash
[ 3634.303341] [   2334]  1000  2334  4122854   661619 13545472    82279             0 python
[ 3634.303342] [   2338]  1000  2338     6840        0    73728     1409             0 python
[ 3634.303344] [   2339]  1000  2339     7857       55    86016     1596             0 python
[ 3634.303345] [   2340]  1000  2340  2820340  1354451 14409728   128062             0 python
[ 3634.303346] [   2341]  1000  2341  2643390  1176030 13004800   130075             0 python
[ 3634.303347] [   2350]  1000  2350    42788      166    86016        0             0 gvfsd-metadata
[ 3634.303348] [   2353]  1000  2353   112500     2228   196608        0             0 update-notifier
[ 3634.303349] [   2412]  1000  2412     4878      426    57344        0             0 bash
[ 3634.303350] [   2462]  1000  2462  1057596    93751  2691072        0             0 firefox
[ 3634.303351] [   2523]  1000  2523   625998     7370   761856        0             0 Privileged Cont
[ 3634.303352] [   2609]  1000  2609  2328407    71923  2531328        0             0 WebExtensions
[ 3634.303353] [   2690]  1000  2690   800494    70273  2551808        0             0 Web Content
[ 3634.303354] [   2717]  1000  2717     4911      453    65536        0             0 bash
[ 3634.303355] [   2739]  1000  2739   179663     1550   147456        0             0 clash
[ 3634.303356] [   2880]  1000  2880   699982    23103  1458176        0             0 Web Content
[ 3634.303357] [   2929]  1000  2929     4911      457    65536        0             0 bash
[ 3634.303358] [   3024]  1000  3024    49518     2017   278528        0             0 RDD Process
[ 3634.303359] [   3211]  1000  3211   668210    42688  1089536        0             0 Web Content
[ 3634.303360] [   3392]  1000  3392   676071    28224  1347584        0             0 Web Content
[ 3634.303361] [   3623]  1000  3623     4878      405    57344        0             0 bash
[ 3634.303362] [   3640]  1000  3640   513174    47215  1593344        0             0 tensorboard
[ 3634.303364] [   3753]  1000  3753    72156      353   118784        0             0 gvfsd-http
[ 3634.303365] [   3955]  1000  3955   749429    32668  1622016        0             0 Web Content
[ 3634.303366] [   4029]  1000  4029     4985      487    65536        0             0 bash
[ 3634.303368] [   4330]  1000  4330     1512       40    57344        0             0 zoom
[ 3634.303369] [   4332]  1000  4332  1090506    54243  1433600        0             0 zoom
[ 3634.303370] [   4771]  1000  4771   606889     5171   516096        0             0 Web Content
[ 3634.303371] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=2340,uid=1000
[ 3634.303380] Out of memory: Killed process 2340 (python) total-vm:11281360kB, anon-rss:5361328kB, file-rss:56452kB, shmem-rss:24kB, UID:1000 pgtables:14072kB oom_score_adj:0
[ 3634.507526] oom_reaper: reaped process 2340 (python), now anon-rss:0kB, file-rss:56444kB, shmem-rss:24kB
[ 4322.213815] usb 2-3.1: USB disconnect, device number 3

It seems like I'm running out of memory(my cpu-memory is 16GB), I wonder what can I do to reduce cpu-memory usage. Here is my ppo_pointnav.yaml: (to avoid gpu-memory overflow, I set the num_processes to be 2 so that the cpu-memory usage for each process became larger)

# Hyperparameters and ResNet18 from on https://arxiv.org/abs/2012.0611

VERBOSE: False

BASE_TASK_CONFIG_PATH: "configs/tasks/pointnav_gibson.yaml"
TRAINER_NAME: "ppo"
ENV_NAME: "NavRLEnv"
SIMULATOR_GPU_ID: 0
TORCH_GPU_ID: 0
VIDEO_OPTION: ["disk", "tensorboard"]
# Can be uncommented to generate videos.
# VIDEO_OPTION: ["disk", "tensorboard"]
TENSORBOARD_DIR: "tb_rrrgb"
VIDEO_DIR: "video_dir"
# Evaluate on all episodes
TEST_EPISODE_COUNT: -1
EVAL_CKPT_PATH_DIR: "data/new_checkpoints"
NUM_ENVIRONMENTS: 6
NUM_PROCESSES: 2
SENSORS: ["RGB_SENSOR"]
CHECKPOINT_FOLDER: "data/new_checkpoints"
TOTAL_NUM_STEPS: 75e6
NUM_UPDATES: -1
LOG_INTERVAL: 25
NUM_CHECKPOINTS: 100

RL:
  PPO:
    # ppo params
    clip_param: 0.2
    ppo_epoch: 4
    num_mini_batch: 2
    value_loss_coef: 0.5
    entropy_coef: 0.01
    lr: 2.5e-4
    eps: 1e-5
    max_grad_norm: 0.5
    num_steps: 128
    hidden_size: 512
    use_gae: True
    gamma: 0.99
    tau: 0.95
    use_linear_clip_decay: True
    use_linear_lr_decay: True
    reward_window_size: 50

    # Use double buffered sampling, typically helps
    # when environment time is similar or large than
    # policy inference time during rollout generation
    use_double_buffered_sampler: False
erikwijmans commented 3 years ago

Reducing the number of processes/environments will reduce CPU memory usage but will also degrade learning performance. Note that NUM_ENVIRONMENTS and NUM_PROCESSES control the same thing as there is 1 process per environment.