cbfinn / gps

Guided Policy Search
http://rll.berkeley.edu/gps/
Other
597 stars 239 forks source link

Crash in the second iteration #19

Open zhudelong opened 8 years ago

zhudelong commented 8 years ago

Hi Finn, Thank you for your excellent work and it is really an excited innovation. And all the demos can work well except the last one. When running "python python/gps/gps_main.py pr2_badmm_example"

it reports errors like this:

I0430 02:03:56.217406   978 solver.cpp:408]     Test net output #5: InnerProduct3 = 0
I0430 02:03:56.217413   978 solver.cpp:408]     Test net output #6: InnerProduct3 = 0
Exception in thread Thread-13:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "python/gps/gps_main.py", line 366, in <lambda>
    target=lambda: gps.run(itr_load=resume_training_itr)
  File "python/gps/gps_main.py", line 69, in run
    self._log_data(itr, traj_sample_lists, pol_sample_lists)
  File "python/gps/gps_main.py", line 240, in _log_data
    copy.copy(self.algorithm)
  File "python/gps/utility/data_logger.py", line 25, in pickle
    pickle.dump(data, open(filename, 'wb'))
  File "/usr/lib/python2.7/copy_reg.py", line 84, in _reduce_ex
    dict = getstate()
  File "python/gps/algorithm/policy_opt/policy_opt_caffe.py", line 233, in __getstate__
    self.solver.snapshot()
AttributeError: 'AdamSolver' object has no attribute 'snapshot'

and also when I run "python python/gps/gps_main.py pr2_example " it reports the following errors sometimes

LinAlgError: 2-th leading minor not positive definite ... 
raise LinAlgError("%d-th leading minor not positive definite" % info) LinAlgError: 2-th leading minor not positive definite

Do you have any idea about these two problems? Looking forward to your answers. Thank you very much.

cbfinn commented 8 years ago

Regarding the first error, make sure you have the latest version of caffe (i.e. this line of code should exist)

Regarding the second error, can you be more specific? How often and when does it appear? I may have time to look into it this weekend.

zhudelong commented 8 years ago

Thank you so much. I have tried the newest caffe but there are some errors. I will figure out it . As for the second problem, I find it occur when I run "python python/gps/gps_main.py pr2_badmm_example" at the first try and the problem disappears in the following callings. it looks like this:

I0501 12:58:00.010692 21648 net.cpp:228] DummyData1 does not need backward computation.
I0501 12:58:00.010694 21648 net.cpp:270] This network produces output InnerProduct3
I0501 12:58:00.010700 21648 net.cpp:283] Network initialization done.
I0501 12:58:00.010727 21648 solver.cpp:59] Solver scaffolding done.
Exception in thread Thread-13:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "python/gps/gps_main.py", line 366, in <lambda>
    target=lambda: gps.run(itr_load=resume_training_itr)
  File "python/gps/gps_main.py", line 67, in run
    self._take_iteration(itr, traj_sample_lists)
  File "python/gps/gps_main.py", line 195, in _take_iteration
    self.algorithm.iteration(sample_lists)
  File "python/gps/algorithm/algorithm_badmm.py", line 48, in iteration
    self._update_dynamics()  # Update dynamics model using all sample.
  File "python/gps/algorithm/algorithm.py", line 84, in _update_dynamics
    self.cur[cond].traj_info.dynamics.update_prior(cur_data)
  File "python/gps/algorithm/dynamics/dynamics_lr_prior.py", line 21, in update_prior
    self.prior.update(X, U)
  File "python/gps/algorithm/dynamics/dynamics_prior_gmm.py", line 98, in update
    self.gmm.update(xux, K)
  File "python/gps/utility/gmm.py", line 174, in update
    logobs = self.estep(data)
  File "python/gps/utility/gmm.py", line 75, in estep
    check_finite=False)
  File "/usr/lib/python2.7/dist-packages/scipy/linalg/decomp_cholesky.py", line 81, in cholesky
    check_finite=check_finite)
  File "/usr/lib/python2.7/dist-packages/scipy/linalg/decomp_cholesky.py", line 30, in _cholesky
    raise LinAlgError("%d-th leading minor not positive definite" % info)
LinAlgError: 27-th leading minor not positive definite
Rahtron3030 commented 7 years ago

Hi Chelsea, I get the same errors: "File "/usr/lib/python2.7/dist-packages/scipy/linalg/decomp_cholesky.py", line 30, in _cholesky raise LinAlgError("%d-th leading minor not positive definite" % info) LinAlgError: 27-th leading minor not positive definite"

when running the pr2_example_badmm experiment.

cbfinn commented 7 years ago

I think there is an bug somewhere in the pr2 controller that causes this error on the very first experiment that is run, after launching the pr2 plugin. For example, this could be caused by the sample data at the first time step to be uninitialized.

After the first run, I don't think that the error will come up. Let me know if this isn't the case for you.

I currently don't have time to investigate the issue personally, but I will post any updates that I hear on this thread.

robotsorcerer commented 7 years ago

I get a related error when I run pr2_badmm_example. It gives

Exception in thread Thread-8:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "python/gps/gps_main.py", line 398, in <lambda>
    target=lambda: gps.run(itr_load=resume_training_itr)
  File "python/gps/gps_main.py", line 66, in run
    self._take_sample(itr, cond, i)
  File "python/gps/gps_main.py", line 184, in _take_sample
    verbose=(i < self._hyperparams['verbose_trials'])
  File "python/gps/agent/ros/agent_ros.py", line 156, in sample
    self.reset(condition)
  File "python/gps/agent/ros/agent_ros.py", line 135, in reset
    condition_data[TRIAL_ARM]['data'])
  File "python/gps/agent/ros/agent_ros.py", line 124, in reset_arm
    self._reset_service.publish_and_wait(reset_command, timeout=timeout)
  File "python/gps/agent/ros/ros_utils.py", line 146, in publish_and_wait
    raise TimeoutException(time_waited)
TimeoutException: ('Timed out after %f seconds', 20.000000000000327)
cbfinn commented 7 years ago

Yes, this issue is because of the controller and is not algorithm specific.

After the first run, I don't think that the error will come up. Let me know if this isn't the case for you.

eaa3 commented 7 years ago

I am also having similar errors on the first experiment.

File "python/gps/utility/gmm.py", line 63, in estep L = scipy.linalg.cholesky(sigma, lower=True) File "/home/ermanoarruda/.virtualenvs/robotics/local/lib/python2.7/site-packages/scipy/linalg/decomp_cholesky.py", line 81, in cholesky check_finite=check_finite) File "/home/ermanoarruda/.virtualenvs/robotics/local/lib/python2.7/site-packages/scipy/linalg/decomp_cholesky.py", line 20, in _cholesky a1 = asarray_chkfinite(a) File "/home/ermanoarruda/.virtualenvs/robotics/local/lib/python2.7/site-packages/numpy/lib/function_base.py", line 1033, in asarray_chkfinite "array must not contain infs or NaNs") ValueError: array must not contain infs or NaNs

However, it works fine after running pr2_example_badmm experiment for the second time (and onwards). The problem seems indeed to be related with initialisation of the first sample.

At some point I was also getting the timeout @lakehanne referred to, but that was because I had not built gps_agent_pkg with the additional caffe flags required for running the pr2_example_badmm.