Clarification on evaluation metric

Hi, Thanks for open-sourcing this framework! I'm trying to reproduce the results of the baselines reported in the Robohive paper, and wanted to ask what is the exact metric that is averaged over 3 seeds in the Franka-expert data runs (here: https://github.com/facebookresearch/agenthive/tree/dev/scripts)? Is it the maximum success rate over a run averaged over 3 seeds or the maximum of the average success rate over 3 seeds or something else? The paper doesn't seem to mention exactly how the success rate of a run is decided (over many checkpoints). Thanks!

facebookresearch / agenthive

Clarification on evaluation metric #18