Closed snailrowen1337 closed 4 years ago
Hi, for SAC States we don't use action repeat at all as per common practice. To adjust to this difference between DrQ (and other methods that use action repeat) and SAC states, we compare performance in true environment steps (see appendix B.3):
Action repeat can be thought as a hyperparameter for a learning algorithm, and for SAC States it happens to be equal to 1.
Ah, that makes perfect sense. This must be the reason for the discrepancy. I will try removing the action repeat, thanks!!
Dear Denis,
Thanks for open-sourcing this, the paper is really cool! I am trying to replicate table 1 with the planet benchmark and ran into some problems for the SAC-state baseline. I am using your implementation of SAC-state (github.com/denisyarats/pytorch_sac) but fail to reach the reported performance. Was action repeat applied to SAC-state in table 1? For each environment, I am using frame_skip = action_repeat, where action_repeat comes from table 2 in the paper. To only use 500,000 environment steps, I set num_train_steps = 500,000 // action_repeat. Am I missing something here? Once I figure this out, I will replicate the DrQ experiments. Thanks!!