Healthcare-Robotics / assistive-gym

Assistive Gym, a physics-based simulation framework for physical human-robot interaction and robotic assistance.
MIT License
301 stars 73 forks source link

Strange issue with training cooperative scratch environment #13

Closed hzyjerry closed 3 years ago

hzyjerry commented 3 years ago

When training co-optimization policy in scratch environment (instruction python -m ppo.train_coop --env-name "ScratchItchJaco-v0" --num-env-steps ...), I ran into this error attached below. The strange thing is that it doesn't show up when training non-cooperative policies in Scratch environment, or when training cooperative policies in other tasks. It seems that this could be an issue with the coop training script.

Any idea on why this happens?

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/jerry/Projects/Assist/pytorch-a2c-ppo-acktr/ppo/train_coop.py", line 309, in <module>
    main()
  File "/home/jerry/Projects/Assist/pytorch-a2c-ppo-acktr/ppo/train_coop.py", line 109, in main
    actor_critic_human = Policy([obs_human_len], action_space_human,
  File "/home/jerry/Projects/Assist/pytorch-a2c-ppo-acktr/ppo/a2c_ppo_acktr/model.py", line 28, in __init__
    self.base = base(obs_shape[0], **base_kwargs)
  File "/home/jerry/Projects/Assist/pytorch-a2c-ppo-acktr/ppo/a2c_ppo_acktr/model.py", line 224, in __init__
    init_(nn.Linear(num_inputs, hidden_size)),
  File "/home/jerry/Projects/Assist/env/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 77, in __init__
    self.reset_parameters()
  File "/home/jerry/Projects/Assist/env/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 80, in reset_parameters
    init.kaiming_uniform_(self.weight, a=math.sqrt(5))
  File "/home/jerry/Projects/Assist/env/lib/python3.8/site-packages/torch/nn/init.py", line 324, in kaiming_uniform_
    std = gain / math.sqrt(fan)
ZeroDivisionError: float division by zero
Exception ignored in: <function SubprocVecEnv.__del__ at 0x7fde831be820>
Traceback (most recent call last):
  File "/home/jerry/Projects/Assist/env/lib/python3.8/site-packages/baselines/common/vec_env/subproc_vec_env.py", line 121, in __del__
    self.close()
  File "/home/jerry/Projects/Assist/env/lib/python3.8/site-packages/baselines/common/vec_env/vec_env.py", line 98, in close
    self.close_extras()
  File "/home/jerry/Projects/Assist/env/lib/python3.8/site-packages/baselines/common/vec_env/subproc_vec_env.py", line 104, in close_extras
    remote.send(('close', None))
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Zackory commented 3 years ago

Does this happen at the beginning or end of training? If it happens at the end, then it is likely not an issue as the policy has been trained and saved. If it occurs at the beginning, have you tried rerunning the script a few times? I recall encountering a bug like this on a few machines where the pytorch training library throws an error once in a while, but rerunning the script would resolve the issue. Sadly, I haven't had the free time to find and resolve this weird bug.

P.S. if you are training on a machine with 4, 8, or 16 virtual cores, I suggest adding an extra parameter '--num-rollouts 32'. I'll be adding this into the documentation soon. It keeps running the simulator for a total of 32 simulation rollouts before updating the PPO policy, and larger batch sizes of around 32 I have found to improve policy performance.

hzyjerry commented 3 years ago

Hi Zack, thanks for the quick response!

The error happens at the beginning of the training, and unfortunately running multiple times didn't make it go away. I'm using my device with 12 cores and tried --num-rollouts 12 and --num-rollouts 24 but didn't work.

Looking more into it, it seems that when defining MLPBase class, the num_inputs is set to 0 and causes error with network initialization. Still looking and unsure why this happens (didn't make any change to the code yet).

hzyjerry commented 3 years ago

Found it, the environment name should be ScratchItchJacoHuman-v0 instead of ScratchItchJaco-v0 :P