HumanCompatibleAI / adversarial-policies

Find best-response to a fixed policy in multi-agent RL
MIT License
275 stars 47 forks source link

About the result in YouShallNotPass experiment #26

Closed nuwuxian closed 4 years ago

nuwuxian commented 5 years ago

Thanks for your nice work! I try to reproduce your results by running the "multi_train with paper". But I found the result is far less than your paper reported in YouShallNotPass experiments. Personally I think this algorithm is very randomness and it depends heavily on the random seed. I run 4 experiments with seed = [0, 1, 2, 3] and only one seed successfully surpass 0.6 winning rates after 20 million steps. I think you could run more seeds to report your results.

nuwuxian commented 5 years ago

Besides, I don't change any hyperparameter to reproduce your result.

AdamGleave commented 5 years ago

We report confidence intervals across 5 random seeds in figure 3. The results in figure 4 are the best random seed, which is stated in the caption. We ran experiments like these several times and got similar results.

Deep RL in general is very sensitive to seed, so I'm not too concerned about the variation we're seeing.

60% is on the low end of our random seeds though so it's concerning that's the best result you saw. It's possible there's some regression since we ran the experiments. I don't have the bandwidth to investigate this right now, but will probably be rerunning experiments like this next month, so if I can replicate the issue I'll investigate. In the meantime let me know if you find anything concrete wrong.

nuwuxian commented 5 years ago

Thank you for your kind response. I have run ran the experiments several times too. (multiple 5 seeds). In fact, I could get a better result and there are some random seeds that could nearly reach 0.8 winning rates. However, the results are in general worse than paper. Maybe this is caused by random seed which relates to the specific machine.