The reward shaping before: in the first stage (gripper get closer to the object) max reward is 1; and the second stage(place object to goal position) max reward is 2; and finally given a 100 when success.
Final reward is 100 and the training example's discount factor gamma is 0.99, the agent will find play around near the goal has higher discounted return:
100 < 2 + 100*0.99 < 2 + 2*0.99 + 100*0.99^2 ...
So I've changed final reward to 200 and gamma to 0.98, to make success immediately is higher than play around the goal's position (200 > 2 + 200*0.98). The result is much better, can successfully place the object to goal in 40 steps on average.
Other changes include using the action wrapper which remove 2 redundant dimensions and increase learning rate from 5e-4 to 1e-3:
I found the episode length begin to increase at the late training progress, which means it takes longer to successfully place object to the goal.
The reason might be the one discussed in https://github.com/HorizonRobotics/SocialRobot/pull/113#discussion_r354649180
The reward shaping before: in the first stage (gripper get closer to the object) max reward is 1; and the second stage(place object to goal position) max reward is 2; and finally given a 100 when success.
Final reward is 100 and the training example's discount factor gamma is 0.99, the agent will find play around near the goal has higher discounted return:
100 < 2 + 100*0.99 < 2 + 2*0.99 + 100*0.99^2
...So I've changed final reward to 200 and gamma to 0.98, to make success immediately is higher than play around the goal's position (200 > 2 + 200*0.98). The result is much better, can successfully place the object to goal in 40 steps on average.
Other changes include using the action wrapper which remove 2 redundant dimensions and increase learning rate from 5e-4 to 1e-3: