Closed gloaming2dawn closed 2 years ago
Hi @gloaming2dawn thanks for your interest. A large portion of the difference between LaMCTS and other baselines in Ant/Humanoid in that figure is due to weight initialization. LaMCTS was initialized at 0 (same as ARS), while other methods are initialized in a local region around 0, which turns out to have a huge difference in the final performance. So the comparison is a bit unfair. We suspect that why ARS works is also due to that (initialized at 0). For Ant/Humanoid policy optimization, initializing the linear weight to be zero seems to have a huge advantage.
We are fixing this issue and will post an updated figure in the paper. Thanks for your patience.
Hi @gloaming2dawn, the paper has been updated to resolve this issue: https://arxiv.org/pdf/2007.00708.pdf. Please let us know whether you have any future questions. Thanks!
Hi, recently we try to reproduce the result of mujoco task Ant-v2 in your paper. However we found all algorithms include LAMCTS cannot find reward more than -1000 after 10000 iterations. (In the paper LAMCTS should get a reward more than 1000.)
According to your paper, We use mean and std for linear policy from https://github.com/modestyachts/ARS and use 20 rollouts to get an average reward. (We use turbo as the local optimizer)
Actually, we got similar results of the paper in simple tasks like Swim, Hopper and HalfCheetah. But for Ant-888d task, we cannot reproduce the result in your paper. Can you reproduce the Ant-v2 result using the current code? Is there anything we need to change to reproduce the results?