Numpy Linalg Error when training model with PPO algorithm

hermanjakobsen commented 3 years ago

Hi!

I have been training my models using the PPO with the OSC_POSE controller. Quite rarely the following error occurs

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hermankj/Documents/masters_thesis/venv/lib/python3.8/site-packages/stable_baselines3/common/vec_env/subproc_vec_env.py", line 29, in _worker
    observation, reward, done, info = env.step(data)
  File "/home/hermankj/Documents/masters_thesis/venv/lib/python3.8/site-packages/stable_baselines3/common/monitor.py", line 97, in step
    observation, reward, done, info = self.env.step(action)
  File "/home/hermankj/Documents/masters_thesis/venv/lib/python3.8/site-packages/robosuite/wrappers/gym_wrapper.py", line 102, in step
    ob_dict, reward, done, info = self.env.step(action)
  File "/home/hermankj/Documents/masters_thesis/venv/lib/python3.8/site-packages/robosuite/environments/base.py", line 280, in step
    self._pre_action(action, policy_step)
  File "/home/hermankj/Documents/masters_thesis/venv/lib/python3.8/site-packages/robosuite/environments/robot_env.py", line 348, in _pre_action
    robot.control(robot_action, policy_step=policy_step)
  File "/home/hermankj/Documents/masters_thesis/venv/lib/python3.8/site-packages/robosuite/robots/single_arm.py", line 260, in control
    torques = self.controller.run_controller()
  File "/home/hermankj/Documents/masters_thesis/venv/lib/python3.8/site-packages/robosuite/controllers/osc.py", line 317, in run_controller
    lambda_full, lambda_pos, lambda_ori, nullspace_matrix = opspace_matrices(self.mass_matrix,
  File "/home/hermankj/Documents/masters_thesis/venv/lib/python3.8/site-packages/numba/np/linalg.py", line 824, in _inv_err_handler
    raise np.linalg.LinAlgError(
numpy.linalg.LinAlgError: Matrix is singular to machine precision.

As I said, the error happens quite rarely- I have experienced it two times while training for a total amount of over 100 million timesteps.

amandlek commented 3 years ago

Thanks for bringing this up - can you please save the arguments to the opspace_matrices function into a npz file (or something similar)? This way, the next time the code crashes, you can post it here and we can reproduce your issue, and see if there's a good way around it.

hermanjakobsen commented 3 years ago

The code crashed with the following values

mass_matrix
[[ 4.59326981e+00  1.10185197e-01  7.15969363e-02  5.22531359e-02
  -2.19264456e-01 -8.19455987e-02]
 [ 1.10185197e-01  4.42715434e+00  1.72530334e+00 -4.79739830e-02
   3.93240090e-02  9.94077174e-03]
 [ 7.15969363e-02  1.72530334e+00  8.03349363e-01  9.19920132e-03
   1.73455666e-02  4.16634737e-03]
 [ 5.22531359e-02 -4.79739830e-02  9.19920132e-03  6.19330560e-02
  -2.92631442e-03 -1.15970994e-03]
 [-2.19264456e-01  3.93240090e-02  1.73455666e-02 -2.92631442e-03
   3.92826452e-02  1.25041998e-02]
 [-8.19455987e-02  9.94077174e-03  4.16634737e-03 -1.15970994e-03
   1.25041998e-02  7.94698359e-03]]
J_full
[[ 9.41989181e-02 -2.30692197e-01 -2.01855671e-01 -1.75258216e-01
  -4.71772788e-02 -3.56256804e-02]
 [ 9.57689837e-01  6.63696729e-02  5.80734633e-02  5.04214300e-02
  -2.09371393e-01 -1.61770122e-01]
 [ 0.00000000e+00 -9.46402321e-01 -5.22462906e-01 -1.31441140e-01
  -3.82522111e-02 -2.83061563e-02]
 [ 0.00000000e+00  2.76483133e-01  2.76483133e-01  2.76483133e-01
  -9.13133997e-01  3.44337436e-01]
 [ 0.00000000e+00  9.61018770e-01  9.61018770e-01  9.61018770e-01
   2.62706782e-01  8.77269132e-02]
 [ 1.00000000e+00  0.00000000e+00 -2.77555756e-17 -1.56125113e-17
  -3.11723355e-01 -9.34738316e-01]]
J_pos
[[ 0.09419892 -0.2306922  -0.20185567 -0.17525822 -0.04717728 -0.03562568]
 [ 0.95768984  0.06636967  0.05807346  0.05042143 -0.20937139 -0.16177012]
 [ 0.         -0.94640232 -0.52246291 -0.13144114 -0.03825221 -0.02830616]]
J_ori
[[ 0.00000000e+00  2.76483133e-01  2.76483133e-01  2.76483133e-01
  -9.13133997e-01  3.44337436e-01]
 [ 0.00000000e+00  9.61018770e-01  9.61018770e-01  9.61018770e-01
   2.62706782e-01  8.77269132e-02]
 [ 1.00000000e+00  0.00000000e+00 -2.77555756e-17 -1.56125113e-17
  -3.11723355e-01 -9.34738316e-01]]

opspace_arguments.zip

yukezhu commented 3 years ago

@hermanjakobsen thanks for helping us with this issue. Fixes will be merged to master soon.

ARISE-Initiative / robosuite

Numpy Linalg Error when training model with PPO algorithm #136