Update pick place task traning example

I found the episode length begin to increase at the late training progress, which means it takes longer to successfully place object to the goal.

The reason might be the one discussed in https://github.com/HorizonRobotics/SocialRobot/pull/113#discussion_r354649180

The reward shaping before: in the first stage (gripper get closer to the object) max reward is 1; and the second stage(place object to goal position) max reward is 2; and finally given a 100 when success.

Final reward is 100 and the training example's discount factor gamma is 0.99, the agent will find play around near the goal has higher discounted return: 100 < 2 + 100*0.99 < 2 + 2*0.99 + 100*0.99^2 ...

So I've changed final reward to 200 and gamma to 0.98, to make success immediately is higher than play around the goal's position (200 > 2 + 200*0.98). The result is much better, can successfully place the object to goal in 40 steps on average.

Other changes include using the action wrapper which remove 2 redundant dimensions and increase learning rate from 5e-4 to 1e-3:

Blue is the original curve of PR 113;
Green is PR113 with action wrapper, note that random range has also been increased a little bit since https://github.com/HorizonRobotics/SocialRobot/pull/116#discussion_r358712436;
Orange is PR113 with action wrapper and 1e-3 learning rate;
Red is current one, with action wrapper, higher LR and new gamma.

HorizonRobotics / SocialRobot

Update pick place task traning example #118