Closed Jialn closed 4 years ago
If not interrupted by the error of "action_log_prob had NaN values", it takes about 15 hours (trained with 4 cores i7-6700HQ laptop CPU. GPU is not used because of OOM problem) to reach similar behavior compared to previous AC example, which takes about 3-4days. SAC also takes several days and need a huge size replay buffer, so it was removed.
The curve of grocery_goaltask_img_ppo.gin @ 249322c
The curve of previous AC
update PPO to std not dependent on state. It has similar performance but does not have numerical instability.
updated LR, a little bit lower earning rate seems more stable
This PR is ready to be checked in now. Please approve if there is no more other problems. @emailweixu
Main Changes: