benelot / pybullet-gym

Open-source implementations of OpenAI Gym MuJoCo environments for use with the OpenAI Gym Reinforcement Learning Research Platform.
https://pybullet.org/
Other
830 stars 124 forks source link

HumanoidBulletEnv-v0 crashing in multiple RL frameworks #10

Open ycps opened 6 years ago

ycps commented 6 years ago

Hello. I have been trying to train an agent in HumanoidBulletEnv-v0. I have tried using multiple frameworks and algorithms, but have not been able to obtain a good policy in this particular environment. I faced similar issues using the RoboschoolHumanoid-v1 environment. I choose to open this issue because most frameworks crashed at some point during the training of these two Humanoid environments. Other simpler environments (such as Hopper and Walker) could be learned with no issues.

Most crashes seem to be caused by: reward == minus infinity, KL divergence == NaN or similar numerical errors. Below is a brief summary of my experiments using both HumanoidBulletEnv-v0 and RoboschoolHumanoid-v1:

env framework algo result
RoboschoolHumanoid-v1 anyrl-py PPO Error: act==nan
RoboschoolHumanoid-v1 TensorForce PPO low reward stall
RoboschoolHumanoid-v1 Coach PPO Error: kl==nan
RoboschoolHumanoid-v1 Coach A3C low reward stall
RoboschoolHumanoid-v1 Baselines PPO Error: rew==-inf and kl=nan
HumanoidBulletEnv-v0 TensorForce PPO low reward stall
HumanoidBulletEnv-v0 Coach PPO Error: kl==nan
HumanoidBulletEnv-v0 Coach DDPG Error: assert(np.isfinite(a).all())
HumanoidBulletEnv-v0 TensorFlow Agents PPO Error: invalid fastbin entry

If you have interest, I can also paste here each individual Traceback. Also, I'm not sure if this issue should be opened in the Bullet repo.

benelot commented 6 years ago

Hi! Thanks for your comprehensive insight into that problem! I am basically the maintainer of the BulletEnvs, so this is the best place to come for these kinds of Problems.

I would be interested in the tracebacks in case you have them.

On Thu, Sep 13, 2018, 03:03 Yuri notifications@github.com wrote:

Hello. I have been trying to train an agent in HumanoidBulletEnv-v0. I have tried using multiple frameworks and algorithms, but have not been able to obtain a good policy in this particular environment. I faced similar issues using the RoboschoolHumanoid-v1 environment. I choose to open this issue because most frameworks crashed at some point during the training of these two Humanoid environments. Other simpler environments (such as Hopper and Walker) could be learned with no issues.

Most crashes seem to be caused by: reward == minus infinity, KL divergence == NaN or similar numerical errors. Below is a brief summary of my experiments using both HumanoidBulletEnv-v0 and RoboschoolHumanoid-v1: env framework algo result RoboschoolHumanoid-v1 anyrl-py https://github.com/unixpickle/anyrl-py PPO Error: act==nan https://github.com/unixpickle/anyrl-py/issues/33 RoboschoolHumanoid-v1 TensorForce https://github.com/reinforceio/tensorforce PPO low reward stall RoboschoolHumanoid-v1 Coach https://github.com/NervanaSystems/coach PPO Error: kl==nan RoboschoolHumanoid-v1 Coach https://github.com/NervanaSystems/coach A3C low reward stall RoboschoolHumanoid-v1 Baselines https://github.com/openai/baselines PPO Error: rew==-inf and kl=nan HumanoidBulletEnv-v0 TensorForce https://github.com/reinforceio/tensorforce PPO low reward stall HumanoidBulletEnv-v0 Coach https://github.com/NervanaSystems/coach PPO Error: kl==nan HumanoidBulletEnv-v0 Coach https://github.com/NervanaSystems/coach DDPG Error: assert(np.isfinite(a).all()) HumanoidBulletEnv-v0 TensorFlow Agents https://github.com/tensorflow/agents PPO Error: invalid fastbin entry

If you have interest, I can also paste here each individual Traceback. Also, I'm not sure if this issue should be opened in the Bullet https://github.com/bulletphysics/bullet3 repo.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/benelot/pybullet-gym/issues/10, or mute the thread https://github.com/notifications/unsubscribe-auth/AC97qzhpxDa3A8J-UJ1x4jT7r275zNAwks5uaa7zgaJpZM4WmZoA .

ycps commented 6 years ago

Thanks for the quick response. I will paste the tracebacks here and include some of the output before the errors as well. The outputs are trimmed in […]:

RoboschoolHumanoid-v1 | anyrl-py | PPO

The traceback can be seen in this issue.

RoboschoolHumanoid-v1 | Coach | PPO

[…]
Training - Worker: 0 Episode: 155992 total reward: -22.269424567876037 steps: 7411139 training iteration: 3571
Training - Worker: 0 Episode: 155993 total reward: 6.494044705819199 steps: 7411197 training iteration: 3571
Policy training - Surrogate loss: 14.117033004760742 KL divergence: nan Entropy: nan training epoch: 0 learning_rate: 0.0001                                 
Policy training - Surrogate loss: 8.368938446044922 KL divergence: nan Entropy: nan training epoch: 1 learning_rate: 0.0001                                  
Policy training - Surrogate loss: 6.048489570617676 KL divergence: nan Entropy: nan training epoch: 2 learning_rate: 0.0001                                  
Policy training - Surrogate loss: 4.809088230133057 KL divergence: nan Entropy: nan training epoch: 3 learning_rate: 0.0001                                  
Policy training - Surrogate loss: 3.972224712371826 KL divergence: nan Entropy: nan training epoch: 4 learning_rate: 0.0001                                  
Policy training - Surrogate loss: 3.4192159175872803 KL divergence: nan Entropy: nan training epoch: 5 learning_rate: 0.0001                                 
Policy training - Surrogate loss: 2.9966487884521484 KL divergence: nan Entropy: nan training epoch: 6 learning_rate: 0.0001                                 
Policy training - Surrogate loss: 2.6894009113311768 KL divergence: nan Entropy: nan training epoch: 7 learning_rate: 0.0001                                 
Policy training - Surrogate loss: 2.403937339782715 KL divergence: nan Entropy: nan training epoch: 8 learning_rate: 0.0001                                  
Policy training - Surrogate loss: 2.193875789642334 KL divergence: nan Entropy: nan training epoch: 9 learning_rate: 0.0001                                  
/home/user/py/env/lib/python3.6/site-packages/numpy/core/_methods.py:26: RuntimeWarning: invalid value encountered in reduce        
  return umr_maximum(a, axis, None, out, keepdims)                                                                                                           
/home/user/py/env/lib/python3.6/site-packages/numpy/core/_methods.py:29: RuntimeWarning: invalid value encountered in reduce        
  return umr_minimum(a, axis, None, out, keepdims)                                                                                                           
Traceback (most recent call last):                                                                                                                           
  File "/home/user/py/env/bin/coach", line 11, in <module>                                                                        
    sys.exit(main())                                                                                                                                         
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/coach.py", line 303, in main                                         
    agent.improve()                                                                                                                                          
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/agents/agent.py", line 511, in improve                               
    self.act()                                                                                                                                               
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/agents/agent.py", line 365, in act                                   
    result = self.env.step(action)                                                                                                                           
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/environments/environment_wrapper.py", line 125, in step              
    self._take_action(action_idx)                                                                                                                            
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/environments/gym_environment_wrapper.py", line 125, in _take_action  
    self.observation, self.reward, self.done, self.info = self.env.step(action)                                                                              
  File "/home/user/py/env/lib/python3.6/site-packages/gym/core.py", line 96, in step                                                
    return self._step(action)                                                                                                                                
  File "/home/user/py/env/lib/python3.6/site-packages/gym/wrappers/time_limit.py", line 36, in _step                                
    observation, reward, done, info = self.env.step(action)                                                                                                  
  File "/home/user/py/env/lib/python3.6/site-packages/gym/core.py", line 96, in step                                                
    return self._step(action)                                                                                                                                
  File "/home/user/roboschool/gym_forward_walker.py", line 91, in _step                                                                 
    self.apply_action(a)                                                                                                                                     
  File "/home/user/roboschool/gym_mujoco_walkers.py", line 110, in apply_action                                                         
    assert( np.isfinite(a).all() )                                                                                                                           
AssertionError

RoboschoolHumanoid-v1 | Baselines | PPO

[…]
-------------------------------------                                                                                                                         
| approxkl           | 0.002441364  |                                                                                                                         
| clipfrac           | 0.021044921  |                                                                                                                         
| eplenmean          | 19.3         |                                                                                                                         
| eprewmean          | -6.2e+37     |                                                                                                                         
| explained_variance | 0.865        |                                                                                                                         
| fps                | 223          |                                                                                                                         
| nupdates           | 2156         |                                                                                                                         
| policy_entropy     | 746.041      |                                                                                                                         
| policy_loss        | 0.0063667977 |                                                                                                                         
| serial_timesteps   | 4415488      |
| time_elapsed       | 2.03e+04     |
| total_timesteps    | 4415488      |
| value_loss         | 0.017560527  |
-------------------------------------
-------------------------------------
| approxkl           | 0.0019486428 |
| clipfrac           | 0.015429688  |
| eplenmean          | 19.4         |
| eprewmean          | -6.45e+37    |
| explained_variance | 0.858        |
| fps                | 223          |
| nupdates           | 2157         |
| policy_entropy     | 746.26373    |
| policy_loss        | 0.0035777562 |
| serial_timesteps   | 4417536      |
| time_elapsed       | 2.03e+04     |
| total_timesteps    | 4417536      |
| value_loss         | 0.019647947  |
-------------------------------------
/home/user/py/env/lib/python3.6/site-packages/numpy/core/_methods.py:70: RuntimeWarning: overflow encountered in reduce
  ret = umr_sum(arr, axis, dtype, out, keepdims)
/home/user/py/env/lib/python3.6/site-packages/numpy/core/_methods.py:112: RuntimeWarning: invalid value encountered in subtract
  x = asanyarray(arr - arrmean)
/home/user/py/baselines/common/running_mean_std.py:23: RuntimeWarning: invalid value encountered in double_scalars
  delta = batch_mean - mean
---------------------------------
| approxkl           | nan      |
| clipfrac           | 0.0      |
| eplenmean          | 19.4     |
| eprewmean          | -inf     |
| explained_variance | nan      |
| fps                | 214      |
| nupdates           | 2158     |
| policy_entropy     | nan      |
| policy_loss        | nan      |
| serial_timesteps   | 4419584  |
| time_elapsed       | 2.03e+04 |
| total_timesteps    | 4419584  |
| value_loss         | nan      |
---------------------------------
Traceback (most recent call last):
  File "/home/user/py/baselines/run.py", line 241, in <module>
    main()
  File "/home/user/py/baselines/run.py", line 218, in main
    model, _ = train(args, extra_args)
  File "/home/user/py/baselines/run.py", line 76, in train
    **alg_kwargs
  File "/home/user/py/baselines/ppo2/ppo2.py", line 245, in learn
    obs, returns, masks, actions, values, neglogpacs, states, epinfos = runner.run() #pylint: disable=E0632
  File "/home/user/py/baselines/ppo2/ppo2.py", line 110, in run
    self.obs[:], rewards, self.dones, infos = self.env.step(actions)
  File "/home/user/py/baselines/common/vec_env/__init__.py", line 98, in step
    return self.step_wait()
  File "/home/user/py/baselines/common/vec_env/vec_normalize.py", line 23, in step_wait
    obs, rews, news, infos = self.venv.step_wait()
  File "/home/user/py/baselines/common/vec_env/dummy_vec_env.py", line 40, in step_wait
    obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(action)
  File "/home/user/py/baselines/bench/monitor.py", line 60, in step
    ob, rew, done, info = self.env.step(action)
  File "/home/user/py/env/lib/python3.6/site-packages/gym/wrappers/time_limit.py", line 31, in step
    observation, reward, done, info = self.env.step(action)
  File "/home/user/roboschool/gym_forward_walker.py", line 91, in _step
    self.apply_action(a)
  File "/home/user/roboschool/gym_mujoco_walkers.py", line 110, in apply_action
    assert( np.isfinite(a).all() )
AssertionError

HumanoidBulletEnv-v0 | Coach | PPO

[…]
Policy training - Surrogate loss: 8.198407173156738 KL divergence: nan Entropy: nan training epoch: 3 learning_rate: 0.0001
Policy training - Surrogate loss: 7.3827972412109375 KL divergence: nan Entropy: nan training epoch: 4 learning_rate: 0.0001
Policy training - Surrogate loss: 7.29280948638916 KL divergence: nan Entropy: nan training epoch: 4 learning_rate: 0.0001
Policy training - Surrogate loss: 6.983189582824707 KL divergence: nan Entropy: nan training epoch: 5 learning_rate: 0.0001
Policy training - Surrogate loss: 6.702338695526123 KL divergence: nan Entropy: nan training epoch: 5 learning_rate: 0.0001
Policy training - Surrogate loss: 6.248958587646484 KL divergence: nan Entropy: nan training epoch: 6 learning_rate: 0.0001
Policy training - Surrogate loss: 6.6143364906311035 KL divergence: nan Entropy: nan training epoch: 6 learning_rate: 0.0001
Policy training - Surrogate loss: 6.0395827293396 KL divergence: nan Entropy: nan training epoch: 7 learning_rate: 0.0001
Policy training - Surrogate loss: 6.089345932006836 KL divergence: nan Entropy: nan training epoch: 7 learning_rate: 0.0001
Policy training - Surrogate loss: 5.8140950202941895 KL divergence: nan Entropy: nan training epoch: 8 learning_rate: 0.0001
Policy training - Surrogate loss: 5.688065528869629 KL divergence: nan Entropy: nan training epoch: 8 learning_rate: 0.0001
Policy training - Surrogate loss: 5.387387752532959 KL divergence: nan Entropy: nan training epoch: 9 learning_rate: 0.0001
Policy training - Surrogate loss: 5.554342269897461 KL divergence: nan Entropy: nan training epoch: 9 learning_rate: 0.0001
/home/user/py/env/lib/python3.6/site-packages/numpy/core/_methods.py:26: RuntimeWarning: invalid value encountered in reduce
  return umr_maximum(a, axis, None, out, keepdims)
/home/user/py/env/lib/python3.6/site-packages/numpy/core/_methods.py:29: RuntimeWarning: invalid value encountered in reduce
  return umr_minimum(a, axis, None, out, keepdims)
Traceback (most recent call last):
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/parallel_actor.py", line 160, in <module>
    agent.improve()
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/agents/agent.py", line 511, in improve
    self.act()
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/agents/agent.py", line 365, in act
    result = self.env.step(action)
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/environments/environment_wrapper.py", line 125, in step
    self._take_action(action_idx)
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/environments/gym_environment_wrapper.py", line 125, in _take_action
    self.observation, self.reward, self.done, self.info = self.env.step(action)
  File "/home/user/py/env/lib/python3.6/site-packages/gym/wrappers/time_limit.py", line 31, in step
    observation, reward, done, info = self.env.step(action)
  File "/home/user/py/env/lib/python3.6/site-packages/pybullet_envs/gym_locomotion_envs.py", line 61, in _step
    self.robot.apply_action(a)
  File "/home/user/py/env/lib/python3.6/site-packages/pybullet_envs/robot_locomotors.py", line 185, in apply_action
    assert( np.isfinite(a).all() )
AssertionError

HumanoidBulletEnv-v0 | Coach | DDPG

(notice this one has the output of multiple workers)

[…]
Training - Worker: 0 Episode: 6545 total reward: -23.926944604927854 steps: 348350 training iteration: 348249                                                                   
Training - Worker: 1 Episode: 8249 total reward: -32.229175614131975 steps: 441105 training iteration: 441004                                                                   
Testing - Worker: 2 Episode: 10064 total reward: -31.744497821379635 steps: 563254 training iteration: 0                                                                        
Training - Worker: 0 Episode: 6546 total reward: -21.75884747952223 steps: 348381 training iteration: 348280                                                                    
Testing - Worker: 2 Episode: 10065 total reward: -33.1433626033453 steps: 563277 training iteration: 0                                                                          
Training - Worker: 1 Episode: 8250 total reward: -18.0238859220568 steps: 441137 training iteration: 441036                                                                     
Testing - Worker: 2 Episode: 10066 total reward: -10.2950390630489 steps: 563315 training iteration: 0                                                                          
Training - Worker: 0 Episode: 6547 total reward: -25.60695221499191 steps: 348409 training iteration: 348308                                                                    
Training - Worker: 1 Episode: 8251 total reward: -18.451235280264513 steps: 441171 training iteration: 441070                                                                   
Traceback (most recent call last):                                                                                                                                              
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/parallel_actor.py", line 158, in <module>                                               
Traceback (most recent call last):                                                                                                                                              
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/parallel_actor.py", line 160, in <module>                                               
    agent.evaluate(sys.maxsize, keep_networks_synced=True)  # evaluate forever                                                                                                  
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/agents/agent.py", line 434, in evaluate                                                 
    agent.improve()                                                                                                                                                             
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/agents/agent.py", line 515, in improve                                                  
    self.act()                                                                                                                                                                  
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/agents/agent.py", line 365, in act                                                      
    episode_ended = self.act(phase=RunPhase.TEST)                                                                                                                               
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/agents/agent.py", line 365, in act                                                      
Traceback (most recent call last):                                                                                                                                              
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/parallel_actor.py", line 160, in <module>                                               
    result = self.env.step(action)                                                                                                                                              
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/environments/environment_wrapper.py", line 125, in step                                 
    result = self.env.step(action)
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/environments/environment_wrapper.py", line 125, in step
    agent.improve()
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/agents/agent.py", line 515, in improve
    self._take_action(action_idx)
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/environments/gym_environment_wrapper.py", line 125, in _take_action
    self._take_action(action_idx)
    self.act()
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/environments/gym_environment_wrapper.py", line 125, in _take_action
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/agents/agent.py", line 365, in act
    self.observation, self.reward, self.done, self.info = self.env.step(action)
  File "/home/user/py/env/lib/python3.6/site-packages/gym/wrappers/time_limit.py", line 31, in step
    self.observation, self.reward, self.done, self.info = self.env.step(action)
  File "/home/user/py/env/lib/python3.6/site-packages/gym/wrappers/time_limit.py", line 31, in step
    result = self.env.step(action)
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/environments/environment_wrapper.py", line 125, in step
    self._take_action(action_idx)
  File "/home/user/py/env/lib/python3.6/site-packages/rl_coach/environments/gym_environment_wrapper.py", line 125, in _take_action
    observation, reward, done, info = self.env.step(action)
  File "/home/user/py/env/lib/python3.6/site-packages/pybullet_envs/gym_locomotion_envs.py", line 61, in _step
    observation, reward, done, info = self.env.step(action)
  File "/home/user/py/env/lib/python3.6/site-packages/pybullet_envs/gym_locomotion_envs.py", line 61, in _step
    self.observation, self.reward, self.done, self.info = self.env.step(action)
  File "/home/user/py/env/lib/python3.6/site-packages/gym/wrappers/time_limit.py", line 31, in step
    observation, reward, done, info = self.env.step(action)
  File "/home/user/py/env/lib/python3.6/site-packages/pybullet_envs/gym_locomotion_envs.py", line 61, in _step
    self.robot.apply_action(a)
    self.robot.apply_action(a)
  File "/home/user/py/env/lib/python3.6/site-packages/pybullet_envs/robot_locomotors.py", line 185, in apply_action
  File "/home/user/py/env/lib/python3.6/site-packages/pybullet_envs/robot_locomotors.py", line 185, in apply_action
    self.robot.apply_action(a)
  File "/home/user/py/env/lib/python3.6/site-packages/pybullet_envs/robot_locomotors.py", line 185, in apply_action
    assert( np.isfinite(a).all() )
    assert( np.isfinite(a).all() )
AssertionError
AssertionError
    assert( np.isfinite(a).all() )
AssertionError

HumanoidBulletEnv-v0 | TensorFlow Agents | PPO

[…]
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.                                                                              
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.                                                                              
*** Error in `/home/user/py/env/bin/python3.6m': invalid fastbin entry (free): 0x00007fe514001180 ***                                                
======= Backtrace: =========                                                                                                                                                    
/lib64/libc.so.6(+0x7cfe1)[0x7feb59987fe1]                                                                                                                                      
/home/user/py/env/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so(+0x5e71ab)[0x7fe8cfb1a1ab]                               
/home/user/py/env/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN5Eigen26NonBlockingThreadPoolTemplIN10tensorflow6threa
d16EigenEnvironmentEE10WorkerLoopEi+0x23d)[0x7fe8cfb8a58d]
/home/user/py/env/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16Eigen
Environment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x32)[0x7fe8cfb89612]
/softwares/gcc/gcc-7.2/lib64/libstdc++.so.6(+0xba050)[0x7fe8cf26c050]
/lib64/libpthread.so.0(+0x7dc5)[0x7feb59cd3dc5]
/lib64/libc.so.6(clone+0x6d)[0x7feb59a0121d]
======= Memory map: ========
7fe380000000-7fe380021000 rw-p 00000000 00:00 0
7fe380021000-7fe384000000 ---p 00000000 00:00 0
7fe398000000-7fe398021000 rw-p 00000000 00:00 0
[…]
benelot commented 6 years ago

I just looked into the issue and unfortunately I can not tell you what is the problem. Currently I do not have enough workforce to look into it. In case you find out anything about it, just tell me.

The only thing I see is that you did not train on my implementations, but on the ones that are included in the pybullet itself. My environments are called *PyBulletEnv-v0. So for instance HumanoidPyBulletEnv-v0 would be the one you wanted to train on. In case you have some capacity to try that too, it would be very helpful.

bibbygoodwin commented 5 years ago

The only thing I see is that you did not train on my implementations, but on the ones that are included in the pybullet itself. My environments are called *PyBulletEnv-v0.

Hi. Could you possibly elaborate on the differences between the locomotion environments that come with pybullet itself (imported as import pybullet_envs once Pybullet is installed) and your environments? I believe that the pybullet versions are based on the original Roboschool (rather than Mujoco) environments, and except for code refactoring they appear to have the same implementation as your *PyBulletEnv-v0 envs?

Are there actually differences, and would you say that your implementations are more faithful to Roboschool? Thank you!

benelot commented 5 years ago

Hi Walter

I will quickly clear up the situation: I started reimplementing the mujoco envs some time ago, but found it to be very hard, since the underlying engine is hard to mimic exactly. Then openAI built the roboschool versions of the mujoco envs, but as it seems with no intent to make it similar to the original mujoco envs. If you look at the state representation, you will see that they differ much. Also they forked the bullet engine and did some minor hack to the code. Also the original roboschool contains some cpp code to make things faster, but with appropriate numpy implementations, I could make the difference marginal. As I found that I could make the roboschool implementations work without their bullet hack and cpp, Erwin from pybullet got interested and accepted my contribution. Then I thought I would take my implementations and try again to rebuild the mujoco envs, this time with more success. Therefore, my repo contains a reimplementation of the roboschool envs and the mujoco envs.

The roboschool envs of openai, the implementations in pybullet and my roboschool reimplementations should perform exactly the same, but do have internal implementation differences. I did not check anymore recently, but I saw some pushes being made to the code, so I am not sure. They pybullet's and my implementations should all stay equally faithful to the original openai roboschool version, although roboschool's bullet engine repo will eventually get very old.

My mujoco envs have the same representation as the original mujoco envs, but the trained agents can not always be transferred (see the table in the readme repo).

Now quickly on what I am trying to do with my repo: -[COMPATIBILITY] For one, I want to make my mujoco envs comparable to the original openai mujoco envs to allow transferring trained agents (freeing people from mujoco licenses, my ultimate goal). -[COMPOSABILITY] I am trying to make the robot body, the actual environment and the fitness function separable to allow easy composability (2 agents with each 3 humans on a footballfield by just declaring a list of agents, 3 instances of the human body robot and the footballfield env would be nice).

If you want to help me with any of this or if I should give you a review of your paper, just tell me. I am currently on holiday, but soon I am available again. Just tell me if anything is unclear, I am happy to collaborate.

Cheers Ben

On Tue, Jun 18, 2019, 18:19 Walter Goodwin notifications@github.com wrote:

The only thing I see is that you did not train on my implementations, but on the ones that are included in the pybullet itself. My environments are called *PyBulletEnv-v0.

Hi. Could you possibly elaborate on the differences between the locomotion environments that come with pybullet itself (imported as import pybullet_envs once Pybullet is installed) and your environments? I believe that the pybullet versions are based on the original Roboschool (rather than Mujoco) environments, and except for code refactoring they appear to have the same implementation as your *PyBulletEnv-v0 envs?

Are there actually differences, and would you say that your implementations are more faithful to Roboschool? Thank you!

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/benelot/pybullet-gym/issues/10?email_source=notifications&email_token=AAXXXK7LQSMPKD775N54M5DP3EDIDA5CNFSM4FUZTIAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX7FRHY#issuecomment-503208095, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXXXK3VJXKKVXPYRRA62K3P3EDIDANCNFSM4FUZTIAA .