AdamGleave commented 5 years ago

Creating PR mainly to simplify code review, but we will want to merge in eventually.

codecov[bot] commented 5 years ago

Codecov Report

Merging #23 into master will increase coverage by 0.23%. The diff coverage is 68.56%.

@@            Coverage Diff             @@
##           master      #23      +/-   ##
==========================================
+ Coverage   61.41%   61.65%   +0.23%     
==========================================
  Files          55       56       +1     
  Lines        4901     5234     +333     
==========================================
+ Hits         3010     3227     +217     
- Misses       1891     2007     +116

Impacted Files	Coverage Δ
src/aprl/policies/loader.py	`81.9% <100%> (ø)`	:arrow_up:
src/aprl/training/logger.py	`100% <100%> (ø)`	:arrow_up:
src/aprl/training/shaping_wrappers.py	`95.69% <100%> (ø)`	:arrow_up:
src/aprl/envs/gym_compete.py	`97.16% <100%> (ø)`	:arrow_up:
src/aprl/training/embedded_agents.py	`95.08% <100%> (ø)`
src/aprl/policies/wrappers.py	`97.5% <100%> (+7.5%)`	:arrow_up:
src/aprl/policies/base.py	`87.65% <100%> (+1.73%)`	:arrow_up:
src/aprl/training/lookback.py	`77.83% <100%> (ø)`	:arrow_up:
tests/test_experiments.py	`97.77% <100%> (+0.05%)`	:arrow_up:
src/aprl/configs/multi/train.py	`16.89% <16.86%> (+1.03%)`	:arrow_up:
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d6f147c...8f7b76e. Read the comment docs.

decodyng commented 5 years ago

Still waiting to see how if the test suite passes, but I believe I've now resolved or responded to almost all comments. Exceptions to that are:

I'm reticent to merge CurryVecEnv with MultiCurryVecEnv because I worry it'll add a lot of somewhat confusing functionality to the CurryVecEnv case which isn't actually necessary to understanding how CurryVecEnv performs in the majority of cases. (I'm also holding off on figuring out a solution to the problem of needing to pass in obs X num_envs to a policy prediction until this bit gets determined)
I have a question about your preferences around implementing get_policy(), which is in a comment on that part of the code

AdamGleave commented 5 years ago

The test failures seem to be legitimate, I get a TensorFlow shape error when I run python -m modelfree.train. This needs to be fixed before merging, and I'd also like to see some simple unit test for [Multi]CurryVecEnv.

decodyng commented 5 years ago

Today I learned about some of the fun edge cases are involved in getting state management to work here. Some of these include:

When training is started, state is globally None, and even stateful policies can take in state=None as an initial condition
However, it's not possible to have some of the envs have vector state and others to have state=None in a stateful policy, which is annoying for cases where we've just switched on a given policy in a given env and don't have state for it yet; I decided to treat newly-started env/policy combos as if they were envs where that policy is not active, and fill in all zeros of the right shape which matches the initial_state value of the policies.
As an aside, it turns out that gym compete policies have the attribute initial_state and stable baselines one have the attribute . _initial_state, which is why I didn't just directly rely on that here

decodyng commented 5 years ago

Thoughts on adversary paths:

I think the fact of paths being auto-generated by highest_win_rate.py as the experiment is being run is part of what makes the current system difficult to follow and reason about. I'd advocate for having a location within the adversarial-policies directory that stores the highest win-rate policies for a given run (e.g. paper/20190429_011349). Since those seem like those won't vary for a given run with fixed/trained adversaries, it'd be nice to just have that be fixed top-level metadata somewhere.
The fact of needing to set ADVERSARY_PATHS as an environment variable also seems like something it might be nice to modify. If we had a fixed-within-data location for the adversary paths JSON to be, it seems like it'd be possible to specify as part of the named config which training run
(e.g. paper/) to use, and have that generate a path.

AdamGleave commented 5 years ago

Thoughts on adversary paths:

I think the fact of paths being auto-generated by highest_win_rate.py as the experiment is being run is part of what makes the current system difficult to follow and reason about. I'd advocate for having a location within the adversarial-policies directory that stores the highest win-rate policies for a given run (e.g. paper/20190429_011349). Since those seem like those won't vary for a given run with fixed/trained adversaries, it'd be nice to just have that be fixed top-level metadata somewhere.

I agree with this criticism. I originally envisaged highest_win_rate.py as potentially pulling in adversaries from multiple training runs, but that's not the common case. I'd be fine with storing a JSON along with each run.

We could even make it so that rather than specifying a path to the json, we specify a path containing adversary policies. And then if the json exists in that path, use it (cached). Otherwise, invoke highest_win_rate.py (as a Python function, not as a script).

The fact of needing to set ADVERSARY_PATHS as an environment variable also seems like something it might be nice to modify. If we had a fixed-within-data location for the adversary paths JSON to be, it seems like it'd be possible to specify as part of the named config which training run (e.g. paper/) to use, and have that generate a path. The environment variable is a horrible hack. It's because named configs get evaluated before CLI arguments for Sacred. So we can't access the CLI argument directly.

Having one named config for each possible ADVERSARY_PATHS would solve this, but I worry it'd lead to an explosion in configs.

HumanCompatibleAI / adversarial-policies

Finetuning victim as defence method #23

Codecov Report