HumanCompatibleAI / adversarial-policies

Find best-response to a fixed policy in multi-agent RL
MIT License
275 stars 47 forks source link

Add support to fine-tune policies from gym_compete #6

Closed AdamGleave closed 5 years ago

AdamGleave commented 5 years ago

Bansal et al released policy weights and architecture in gym_compete. We already adapted the interface to be able to load an agent and replay it (used in score_agent and to embed victim in train), but it did not previously have support to continue training the loaded agent.

This PR adds support for this. (Note most changes take place in our fork of gym_compete, this is just glue code.)

codecov[bot] commented 5 years ago

Codecov Report

Merging #6 into master will decrease coverage by 0.13%. The diff coverage is 76.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master       #6      +/-   ##
==========================================
- Coverage   73.99%   73.86%   -0.14%     
==========================================
  Files          29       29              
  Lines        2023     2020       -3     
==========================================
- Hits         1497     1492       -5     
- Misses        526      528       +2
Flag Coverage Δ
#aprl 26.23% <0%> (+0.03%) :arrow_up:
#modelfree 56.43% <76.36%> (-0.17%) :arrow_down:
Impacted Files Coverage Δ
src/modelfree/train.py 90.36% <47.61%> (-1.66%) :arrow_down:
src/modelfree/gym_compete_conversion.py 96.77% <94.11%> (+1.72%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update daa8bb0...b2acbb7. Read the comment docs.

AdamGleave commented 5 years ago

Quick test running experiments/modelfree/score-old-vs-new-zoo.sh in ae33117dccf5a76701ab926cc2331f65b042a65c shows no significant differences in overall win-rate (testing new vs new, new vs old, old vs new, old vs old). There is some difference though which I think is related to the random sampling in TensorFlow depending on operation seeds. (I tried to pin this down more precisely but TensorFlow does not make reproducibility easy.)

*** KickAndDefend-v0 ***                                                                                                                                                                              

==> data/score-old-vs-new/parallel/a_type/zoo/b_type/zoo/env_name/multicomp\_KickAndDefend-v0/stderr <==
INFO - score - Result: {'ties': 17, 'wincounts': [669, 314]}
INFO - score - Completed after 0:12:14

==> data/score-old-vs-new/parallel/a_type/zoo/b_type/zoo_old/env_name/multicomp\_KickAndDefend-v0/stderr <==
INFO - score - Result: {'ties': 16, 'wincounts': [667, 317]}
INFO - score - Completed after 0:12:24

==> data/score-old-vs-new/parallel/a_type/zoo_old/b_type/zoo/env_name/multicomp\_KickAndDefend-v0/stderr <==
INFO - score - Result: {'ties': 19, 'wincounts': [686, 295]}
INFO - score - Completed after 0:12:09

==> data/score-old-vs-new/parallel/a_type/zoo_old/b_type/zoo_old/env_name/multicomp\_KickAndDefend-v0/stderr <==
INFO - score - Result: {'ties': 14, 'wincounts': [667, 319]}
INFO - score - Completed after 0:12:10

*** SumoHumans-v0 ***

==> data/score-old-vs-new/parallel/a_type/zoo/b_type/zoo/env_name/multicomp\_SumoHumans-v0/stderr <==
INFO - score - Result: {'ties': 10, 'wincounts': [828, 162]}
INFO - score - Completed after 0:09:04

==> data/score-old-vs-new/parallel/a_type/zoo/b_type/zoo_old/env_name/multicomp\_SumoHumans-v0/stderr <==
INFO - score - Result: {'ties': 15, 'wincounts': [830, 155]}
INFO - score - Completed after 0:08:58

==> data/score-old-vs-new/parallel/a_type/zoo_old/b_type/zoo/env_name/multicomp\_SumoHumans-v0/stderr <==
INFO - score - Result: {'ties': 18, 'wincounts': [825, 157]}
INFO - score - Completed after 0:09:03

==> data/score-old-vs-new/parallel/a_type/zoo_old/b_type/zoo_old/env_name/multicomp\_SumoHumans-v0/stderr <==
INFO - score - Result: {'ties': 20, 'wincounts': [809, 171]}
INFO - score - Completed after 0:09:07

*** RunToGoalHumans-v0 ***

==> data/score-old-vs-new/parallel/a_type/zoo/b_type/zoo/env_name/multicomp\_RunToGoalHumans-v0/stderr <==
INFO - score - Result: {'ties': 269, 'wincounts': [277, 454]}
INFO - score - Completed after 0:04:11

==> data/score-old-vs-new/parallel/a_type/zoo/b_type/zoo_old/env_name/multicomp\_RunToGoalHumans-v0/stderr <==
INFO - score - Result: {'ties': 274, 'wincounts': [274, 452]}
INFO - score - Completed after 0:04:08

==> data/score-old-vs-new/parallel/a_type/zoo_old/b_type/zoo/env_name/multicomp\_RunToGoalHumans-v0/stderr <==
INFO - score - Result: {'ties': 248, 'wincounts': [302, 450]}
INFO - score - Completed after 0:04:06

==> data/score-old-vs-new/parallel/a_type/zoo_old/b_type/zoo_old/env_name/multicomp\_RunToGoalHumans-v0/stderr <==
INFO - score - Result: {'ties': 270, 'wincounts': [280, 450]}
INFO - score - Completed after 0:04:03

*** YouShallNotPassHumans-v0 ***

==> data/score-old-vs-new/parallel/a_type/zoo/b_type/zoo/env_name/multicomp\_YouShallNotPassHumans-v0/stderr <==
INFO - score - Result: {'ties': 0, 'wincounts': [497, 503]}
INFO - score - Completed after 0:04:25

==> data/score-old-vs-new/parallel/a_type/zoo/b_type/zoo_old/env_name/multicomp\_YouShallNotPassHumans-v0/stderr <==
INFO - score - Result: {'ties': 0, 'wincounts': [511, 489]}
INFO - score - Completed after 0:04:21

==> data/score-old-vs-new/parallel/a_type/zoo_old/b_type/zoo/env_name/multicomp\_YouShallNotPassHumans-v0/stderr <==
INFO - score - Result: {'ties': 0, 'wincounts': [476, 524]}
INFO - score - Completed after 0:04:20

==> data/score-old-vs-new/parallel/a_type/zoo_old/b_type/zoo_old/env_name/multicomp\_YouShallNotPassHumans-v0/stderr <==
INFO - score - Result: {'ties': 0, 'wincounts': [474, 526]}
INFO - score - Completed after 0:04:16