Action-dependent baseline

cathywu commented 7 years ago

Refactors NPO (as NPOAction) to support factorized policies (assuming conditional independence, which is satisfied by the diagonal covariance Gaussian MLP policy).
Implements an action-dependent baseline (wrapping the linear features baseline).
Computes the k sets of advantages (using action-dependent baselines) used in the computation of the gradient estimate (along with the factorized policy).
Uses a tuned-ish TRPO from the QProp paper for a preliminary comparison on the Walker2d-v1env. The "ish" is because the parameters were tuned for the half cheetah env.

Testing done: The following interprets, runs, and trains.

python3 examples/walker_tf_comparison.py

Main issue: There's negligible improvement over the tuned TRPO (from QProp).

trpo qprop vs action-dependent baseline Blue has the action-dependent baseline. It's still running, so this is just showing the first 600 or so iterations.

Issues: 1) (EDIT: RESOLVED) The computed explained variance is 0 for all but the first baseline. This is puzzling. Not sure why this is.

2017-04-16 21:15:13.146890 PDT | AverageReturn               0.162546
2017-04-16 21:15:13.149222 PDT | ExplainedVariance[0]        0.257389
2017-04-16 21:15:13.149519 PDT | ExplainedVariance[1]        0
2017-04-16 21:15:13.149814 PDT | ExplainedVariance[2]        0
2017-04-16 21:15:13.150105 PDT | ExplainedVariance[3]        0
2017-04-16 21:15:13.150382 PDT | ExplainedVariance[4]        0
2017-04-16 21:15:13.150664 PDT | ExplainedVariance[5]        0
2017-04-16 21:15:13.150933 PDT | NumTrajs                  248

2) Another observation is that the surrogate loss surr_loss has different values when computed "manually" vs via tf, as per the following debug outputs:

CATHYWU surr_loss 1.33298e-08
cathywu loss -2.42878e-08
cathywu mean loss_vec -2.42878e-08

I'd like to understand where the discrepancy comes from. I perform the manual computation in npo_action.py:155.

Sanity check: I implemented the policy factorization first (without changing the baseline at all), and confirmed that the training is unaffected by the policy factorization, which is good.

trpo qprop with and without policy factorization They are about the same; specifically, the green line refers to the implementation supporting policy factorization.

cathywu commented 7 years ago

Testing done:

nose2 tests.test_baselines_action
nose2 tests.test_baselines

cathywu commented 7 years ago

New summary of changes:

Refactors NPO (as NPOAction) to support factorized policies (assuming conditional independence, which is satisfied by the diagonal covariance Gaussian MLP policy).
Implements action-dependent baselines (wrapping the linear features baseline and the Gaussian MLP baseline); additionally implements an action-dependent baseline parent class.
Computes the k sets of advantages (using action-dependent baselines) used in the computation of the gradient estimate (along with the factorized policy).
The implementation is backwards compatible: BatchPolopt and BaseSampler use baseline.action_dependent attribute to determine whether to compute 1 or k sets of advantages.
Uses a tuned-ish TRPO from the QProp paper for a preliminary comparison on the Walker2d-v1env. The "ish" is because the parameters were tuned for the half cheetah env.
Adds a unit test test_baseline_action.py for both action-dependent baselines.

cathywu commented 7 years ago

New summary of changes:

Refactors NPO (as NPOAction) to support factorized policies (assuming conditional independence, which is satisfied by the diagonal covariance Gaussian MLP policy).
Implements action-dependent baselines (wrapping the linear features baseline and the Gaussian MLP baseline); additionally implements an action-dependent baseline parent class.
Computes the k sets of advantages (using action-dependent baselines) used in the computation of the gradient estimate (along with the factorized policy).
The implementation is backwards compatible: BatchPolopt and BaseSampler use baseline.action_dependent attribute to determine whether to compute 1 or k sets of advantages.
Uses a tuned-ish TRPO from the QProp paper for a preliminary comparison on the Walker2d-v1env. The "ish" is because the parameters were tuned for the half cheetah env.
Adds a unit test test_baseline_action.py for both action-dependent baselines.
Baseline comparison experiment for ec2 (Walker2d-v1) using @dementrock's 20170417 docker build for a tf version 1.0 (with several temporary workarounds).

Testing done:

nose2 tests.test_baselines
nose2 tests.test_baselines_action

cathywu / rllab

Action-dependent baseline #1