Silent-Zebra / POLA

MIT License
9 stars 5 forks source link

Inner agent 1 uses agent 2 as a reference #7

Closed cooijmanstim closed 1 year ago

cooijmanstim commented 1 year ago

Here (and line 1097) agent 2 is passed in as a reference for agent 1. IIUC this means that the KL penalty pushes inner agent 1 toward agent 2, and worse, agent 2's gradient accounts for this. So agent 2 can influence agent 1 just by taking on the policy that it wants agent 1 to take on? Pardon my speculation, just trying to understand the implications.

https://github.com/Silent-Zebra/POLA/blob/6b07e89317b07d91216db9d02c1f915f9313b66a/jax_files/POLA_dice_jax.py#L1050

Silent-Zebra commented 1 year ago

I'm not sure I understand why the KL div is mixed. At least in the main code, I think the state history is correct, though it looks to be wrong in the old_kl_div. Perhaps I should remove that part from the code to make it cleaner as I didn't end up using that anyway (and perhaps your comment explains why I had some issues with that).

cooijmanstim commented 1 year ago

The line 1050 I referenced above passes agent 2 down into inner_steps_plus_update_otheragent1, which passes it down into inner_step_get_grad_otheragent1, which passes it into in_lookahead here: https://github.com/Silent-Zebra/POLA/blob/6b07e89317b07d91216db9d02c1f915f9313b66a/jax_files/POLA_dice_jax.py#L566 Which then uses it to compute the (new-style) KL at https://github.com/Silent-Zebra/POLA/blob/6b07e89317b07d91216db9d02c1f915f9313b66a/jax_files/POLA_dice_jax.py#L472 Right?

Silent-Zebra commented 1 year ago

Oh I see what you're saying and I think you're right. I will look into this, including the possible implications. Thanks for pointing it out.

Silent-Zebra commented 1 year ago

To follow up, I made the fixes to all the issues you raised, I added a flag/option for forward vs reverse kl div, and cleaned the code a bit by removing old_kl_div stuff. I did some preliminary tests on the coin game using the same hyperparameters I had before. POLA achieves ~70% same coin (over 5 seeds) with a slight difference depending on the KL choice (forward=68%, reverse=72%) vs ~80% in the paper; POLA-OM achieves ~70% (over 3 seeds) which is basically the same as in the paper (might be lucky seeds though); LOLA results are same as in paper (over 5 seeds). Of course I would need more seeds to be sure, but since I spent no time at all on further hyperparameter tuning, this gives me some confidence that while the results in the paper might change, the overall conclusions shouldn’t.

This is of course assuming there are no further issues. Given the issues you already found, I plan to revisit my code and review carefully for further issues. Unfortunately I do not have enough time to do that now, but I believe I will have time in mid-April. After that (and if you or anyone else find any more issues), I then plan to rerun experiments, including with further hyperparameter search, and update the results in the paper.

Thanks again for your interest in my work, going through my code carefully, for the corrections, and for helping me improve my work; I appreciate your time, effort, and communication.

Silent-Zebra commented 1 year ago

Hello - sorry for the late follow-up. I squeezed some time in late May to further review my code. I found an additional bug related to opponent modeling that I fixed. I reran experiments over the past month, including with additional hyperparameter tuning. I have updated the readme file with new hyperparameters and new plots for the new runs (with the fixed code and new hyperparameters). IPD results are similar (slightly better for POLA), same for POLA-DiCE in the coin game, while POLA-OM is better now in the coin game.

Thanks again for your time and effort in helping me improve my work!