clvrai / spirl

Official implementation of "Accelerating Reinforcement Learning with Learned Skill Priors", Pertsch et al., CoRL 2020
186 stars 36 forks source link

The success rate definition of Maze Navigation Env #32

Closed CeHao1 closed 1 year ago

CeHao1 commented 1 year ago

Hi, I am Ce Hao and I am reproducing your code for SPiRL paper.

In Figure 4 of the paper, the success rate of Maze Navigation reached almost 1 after 1 M steps.

However, in the wandb logger, there is no variable called 'success rate', so I presume this 'success rate' is an indirect variable. The definition is, at each epoch(50 episodes), if at least one reward > 1, which means the agent at least reaches the target once; then we think it is successful. And we calculate the mean and standard deviation of the success rate over 3 seeds.

However, the real experiments are different. Also as you show in Figure 5, SPiRL (Ours), the agent is still exploring many other places, but not converging to the path directly to the goal. My reproduction also shows that only less than 20% of trajectories finally reach the target.

I want to develop new algorithm on the SPiRL baseline, so could you please help us explain the definition of the success rate of Maze Navigation? Thanks!

Best, Ce Hao

Folly135 commented 1 year ago

Hi, Ce Hao, I have encountered the same problem as you, and I have not found the definition of success rate. Have you solved this problem?

Thank you very much.

kpertsch commented 1 year ago

Hey, sorry I missed this! In the maze the agent receives a reward only when it is close to the target, see reward definition here: https://github.com/kpertsch/d4rl/blob/master/d4rl/pointmaze/maze_model.py#L135 Note that currently the episode does not terminate when the agent reaches the goal, so an agent is incentivized to reach there quickly since it can collect rewards for the remainder of the episode by staying close to the goal.

The success rate in the paper is defined as the agent reaching the goal at some point in the episode, i.e. getting an episode reward >= 1.

I hope this answers your questions!

Folly135 commented 1 year ago

Is the setting of kitchen env the same? "Reward" directly represents "Complete subgoals"?

Thank you very much!

kpertsch commented 1 year ago

Yes, that's correct!

CeHao1 commented 1 year ago

Thank you Kpertsch. I understand the definition and agree it is reasonable.

But when I try to reproduce the results of SPiRL open-loop in Maze navigation (Figure 4 in the SPiRL paper), the policy might not converge when random seed = 0 and 3. Although the agent can frequently reach the target, the success rate could not be near 100% within 1.5M steps. The actual success rate is less than 10%. However, in some random seeds, the success rate could be larger than 90%. So the performance is very sensitive to the random seeds.

I also test the experiment in Figure 13, where we choose a simpler target and the agent with all 5 seeds can make the policy converge and always reach the target.

Thanks for the explanation and I plan to also enable the fine-tuning of the low-level decoder to enable more skill adaptation and exploration. Thanks a lot!