cathywu / rllab

rllab is a framework for developing and evaluating reinforcement learning algorithms, fully compatible with OpenAI Gym.
Other
1 stars 0 forks source link

Action-dependent baseline #1

Open cathywu opened 7 years ago

cathywu commented 7 years ago

Testing done: The following interprets, runs, and trains.

python3 examples/walker_tf_comparison.py

Main issue: There's negligible improvement over the tuned TRPO (from QProp).

trpo qprop vs action-dependent baseline Blue has the action-dependent baseline. It's still running, so this is just showing the first 600 or so iterations.

Issues: 1) (EDIT: RESOLVED) The computed explained variance is 0 for all but the first baseline. This is puzzling. Not sure why this is.

2017-04-16 21:15:13.146890 PDT | AverageReturn               0.162546
2017-04-16 21:15:13.149222 PDT | ExplainedVariance[0]        0.257389
2017-04-16 21:15:13.149519 PDT | ExplainedVariance[1]        0
2017-04-16 21:15:13.149814 PDT | ExplainedVariance[2]        0
2017-04-16 21:15:13.150105 PDT | ExplainedVariance[3]        0
2017-04-16 21:15:13.150382 PDT | ExplainedVariance[4]        0
2017-04-16 21:15:13.150664 PDT | ExplainedVariance[5]        0
2017-04-16 21:15:13.150933 PDT | NumTrajs                  248

2) Another observation is that the surrogate loss surr_loss has different values when computed "manually" vs via tf, as per the following debug outputs:

CATHYWU surr_loss 1.33298e-08
cathywu loss -2.42878e-08
cathywu mean loss_vec -2.42878e-08

I'd like to understand where the discrepancy comes from. I perform the manual computation in npo_action.py:155.

Sanity check: I implemented the policy factorization first (without changing the baseline at all), and confirmed that the training is unaffected by the policy factorization, which is good.

trpo qprop with and without policy factorization They are about the same; specifically, the green line refers to the implementation supporting policy factorization.

cathywu commented 7 years ago

Testing done:

nose2 tests.test_baselines_action
nose2 tests.test_baselines
cathywu commented 7 years ago

New summary of changes:

cathywu commented 7 years ago

New summary of changes:

Testing done:

nose2 tests.test_baselines
nose2 tests.test_baselines_action