Simple envs for sanity checking

cathywu / rllab

rllab is a framework for developing and evaluating reinforcement learning algorithms, fully compatible with OpenAI Gym.

Other

1 stars 0 forks source link

Simple envs for sanity checking #9

Open cathywu opened 7 years ago

cathywu commented 7 years ago

No state env: there is no state, and the goal is to minimize the action vector.
Multiaction point env: there is 1 agent, the overall action is the sum over k actions.
Multiagent point env with collisions: same multiagent point env as before but terminates when two point masses get sufficiently close.

Experiment snapshot from commit 43f7576.

python3 examples/cluster_multiagent_point_comparison.py

cathywu commented 7 years ago

Preliminary results [exp cluster-multiagent-v10]

Summary: the action-dependent baseline does not improve/regress on training (the training curves look identical across all experiments so far), even though in the case of NoStateEnv with k=6, its explained variance is much larger.

NoStateEnv, with k=6, average return: 2017-04-26-nostateenv-k6

NoStateEnv, with k=50, average return: 2017-04-26-nostateenv-k50

MultiactionPointEnv, with k=6, average return: 2017-04-26-multiactionpointenv-k6

NoStateEnv, with k=6, explained variance: 2017-04-26-nostateenv-k6-explainedvariance Furthermore, as a bonus, this somewhat confirms that I didn't accidentally just run with the same (non action dependent baseline) for all the experiments.

cathywu commented 7 years ago

New env: OneStepNoStateEnv (commit efb7e17)

OneStepNoStateEnv, with k=6, average return (the two curves are exactly overlapping): 2017-04-26-onestepnostateenv-k6

OneStepNoStateEnv, with k=6, explained variance (huhhh?): 2017-04-26-onestepnostateenv-k6-explainedvariance

cathywu commented 7 years ago

Possible reasons action-dependent (AD) baseline not effective

Envs are too easy.
Envs are too low dimensional. [try k=200]
We're taking so many samples per batch (on such simple problems), such that variance is sufficiently reduced enough to not need a AD baseline? [try batch sizes of 100, 500, 1000, 5000 -- NOTE: these are currently running]

cathywu commented 7 years ago

OneStepNoStateEnv, with k=6, batch size=100, average return (the two curves are exactly overlapping): 2017-04-26-onestepnostateenv-k6-batch100

OneStepNoStateEnv, with k=6, batch size=100, explained variance (huhhhhhh): 2017-04-26-onestepnostateenv-k6-batch100-explainedvariance

The high explained variance (values of 1) of the GaussianMLPBaseline results from low variance among the return values, after a good policy has been learned, e.g.:

>> np.var(returns)
8.584473133996777e-09

The baseline is then overfitting to the only value it sees.

cathywu commented 7 years ago

Summary: ZeroBaseline (with whitening) does better! [exp cluster-multiagent-v11]

OneStepNoStateEnv, with k=6: 2017-04-28-zerobaselinerocksonestepenv

OneStepNoStateEnv, with k=50: 2017-04-28-zerobaselinerocksonestepenv-k50

OneStepNoStateEnv, with k=200: 2017-04-28-zerobaselinerocksonestepenv-k200

Hypothesis: poor fits of the NN baselines (because the linear feature baselines below seem to appear to match the performance of the whitened ZeroBaseline here).

Recommendation: Ignore the NN baseline results in these runs.

From Rocky: It shouldn't be a fitting problem. The center adv option should undo everything the baseline does, since it recenters the advantage estimates.

OneStepNoStateEnv, with k=200, holdout loss for the NN baseline (vf): 2017-04-29-onestepnostateenv-k200-holdoutloss-vf

OneStepNoStateEnv, with k=200, holdout loss for the NN baseline (vf0): 2017-04-29-onestepnostateenv-k200-holdoutloss-vf0

cathywu commented 7 years ago

Summary [exp cluster-multiagent-v12]

Experiment: OneStepNoStateEnv
ZeroBaseline (without whitening) does worse as expected.
In low dimensions, linear baselines basically match performance of the whitened ZeroBaseline (above).
In higher dimensions, the action dependent linear baseline does indeed perform better than the linear baseline.

I'm now running experiments with high dimensions to see a more pronounced effect (k=1000, 2000). EDIT (2017-05-02): Added figures for k=500, 1000, 2000.

OneStepNoStateEnv, with k=6, no whitening (center_adv=False): 2017-04-29-onestepnostateenv-nowhitening-k6 OneStepNoStateEnv, with k=50, no whitening (center_adv=False): OneStepNoStateEnv, with k=200, no whitening (center_adv=False) (First positive result!): OneStepNoStateEnv, with k=500, no whitening (center_adv=False) (Positive result!): 2017-05-02-onestepnostateenv-nowhitening-k500 OneStepNoStateEnv, with k=1000, no whitening (center_adv=False) (Less conclusive.): 2017-05-19-onestepnostateenv-nowhitening-k1000 OneStepNoStateEnv, with k=2000, no whitening (center_adv=False) (Hitting scaling issues.):

cathywu commented 7 years ago

Summary [exp cluster-multiagent-v13]

Setting: Done condition for the MultiagentPointEnv and MultiactionPointEnv is reaching the origin.
MultiactionPointEnv: linear baseline does better than action-dependent baseline in higher dimensions. (We would have expected the opposite. Why might there be regression?)
MultiagentPointEnv: action dependent baseline does significantly better than linear baseline in higher dimensions. (Second positive result!)
MultiagentPointEnv: Does the action dependent baseline improve sample efficiency? I.e. can we get away with smaller batch sizes? The last 3 figures show that this experiment is totally inconclusive on this problem. (Recommendation: run more trials of this experiment.)

MultiactionPointEnv, k=6, no whitening, done=reach origin: 2017-05-02-multiactionpointenv-doneisreachorigin-k6 MultiactionPointEnv, k=1000, no whitening, done=reach origin:

MultiagentPointEnv, k=6, no whitening, done=reach origin: 2017-05-02-multiagentpointenv-doneisreachorigin-k6 MultiagentPointEnv, k=50, no whitening, done=reach origin: MultiagentPointEnv, k=200, no whitening, done=reach origin: MultiagentPointEnv, k=500, no whitening, done=reach origin:

MultiagentPointEnv, k=200, no whitening, done=reach origin, batch size=100: 2017-05-02-multiagentpointenv-doneisreachorigin-k200-batch_size100 MultiagentPointEnv, k=200, no whitening, done=reach origin, batch size=500: MultiagentPointEnv, k=200, no whitening, done=reach origin, batch size=1000: