I want to know how the maximum score is obtained for MuJoCo tasks? From the wiki (https://github.com/rail-berkeley/d4rl/wiki/Dataset-Reproducibility-Guide#gym-mujocogym-bullet), it seems that we use the stochastic SAC policy to obtain the expert dataset. But, in rlkit, we evaluate the performance of SAC by its deterministic policy. Typically, if we use the stochastic policy to evaluate, the performance is not very good. Thus, I am not sure whether the reported maximum score is based on the deterministic policy or the stochastic policy.
If the reported score is based on the deterministic policy, should we consider the deterministic policy to collect the expert dataset?
Hi,
I want to know how the maximum score is obtained for MuJoCo tasks? From the wiki (https://github.com/rail-berkeley/d4rl/wiki/Dataset-Reproducibility-Guide#gym-mujocogym-bullet), it seems that we use the stochastic SAC policy to obtain the expert dataset. But, in rlkit, we evaluate the performance of SAC by its deterministic policy. Typically, if we use the stochastic policy to evaluate, the performance is not very good. Thus, I am not sure whether the reported maximum score is based on the deterministic policy or the stochastic policy.
If the reported score is based on the deterministic policy, should we consider the deterministic policy to collect the expert dataset?
Highly appreciate it if anyone can help.