Open cathywu opened 7 years ago
Testing done:
nose2 tests.test_baselines_action
nose2 tests.test_baselines
New summary of changes:
BatchPolopt
and BaseSampler
use baseline.action_dependent
attribute to determine whether to compute 1 or k sets of advantages.Walker2d-v1
env. The "ish" is because the parameters were tuned for the half cheetah env.test_baseline_action.py
for both action-dependent baselines.New summary of changes:
BatchPolopt
and BaseSampler
use baseline.action_dependent
attribute to determine whether to compute 1 or k sets of advantages.Walker2d-v1
env. The "ish" is because the parameters were tuned for the half cheetah env.test_baseline_action.py
for both action-dependent baselines.Testing done:
nose2 tests.test_baselines
nose2 tests.test_baselines_action
Walker2d-v1
env. The "ish" is because the parameters were tuned for the half cheetah env.Testing done: The following interprets, runs, and trains.
Main issue: There's negligible improvement over the tuned TRPO (from QProp).
Blue has the action-dependent baseline. It's still running, so this is just showing the first 600 or so iterations.
Issues: 1) (EDIT: RESOLVED) The computed explained variance is 0 for all but the first baseline. This is puzzling. Not sure why this is.
2) Another observation is that the surrogate loss
surr_loss
has different values when computed "manually" vs via tf, as per the following debug outputs:I'd like to understand where the discrepancy comes from. I perform the manual computation in
npo_action.py:155
.Sanity check: I implemented the policy factorization first (without changing the baseline at all), and confirmed that the training is unaffected by the policy factorization, which is good.
They are about the same; specifically, the green line refers to the implementation supporting policy factorization.