Closed nickuncaged1201 closed 3 years ago
Can confirm we're seeing the same issue. @nickuncaged1201 please report back if you figure out any settings that actually learn... Thanks.
Hi, you need to train for more than 50k steps. Try at least a few million steps. In case it still doesn't train, report back and I'll reopen the issue.
By the time of reporthing this, I have trained it for 5 millions steps. The training return right now is about -20, -19 is the highest I have seen so far. The training setting essentials are still the same, with only minor changes to log and eval frequency. Is this considered an improvement with steps?
Here is what mine looks like after 2.9M steps. My returns are consistent with what you're reporting @nickuncaged1201 :
Not a bad strategy actually if it can start connecting. Will update if/when I get to 5M+
@danijar If you wouldn't mind advising: my team and I have now trained two separate models to 8M+ steps with the default settings on Pong and are still seeing no improvement in game score. Inferring from the chart in Appendix F of the paper, it appears that by 8M steps we should be close to the slope of rapid improvement in Pong? Would you mind advising whether we are seeing the expected behavior? I realize we're still at only 4% of the 200M frames reported in the paper, however Appendix F makes it appear we should already be seeing results with Pong by this point. Would appreciate your input. Thank you. A shot of a few of the graphs and eval videos attached (at current timestep the agent has again begun holding the "down" button.)
Discussion continuing here: https://github.com/danijar/dreamerv2/issues/8
Thanks for the updated release. I just downloaded the code and made a fresh environment as detailed in the readme. I tried to train the script with everything set to default by simply running "python dreamerv2/train.py --logdir ./logdir/atari_pong --configs defaults atari --task atari_pong". After 50k steps, the return doesn't seem to increase at all. The atari pong task should have a random reward of around -20 and what I got so far is just that. Any suggestion on why this is the case?
Here is the configs.yaml just in case you need it. The only place I changed in the code is the steps in line 8 and 77 where I reduce them to 1e7. Even at a fewer number of steps, I think I should be expecting some improvements in return.
defaults:
Train Script
logdir: /dev/null seed: 0 task: dmc_walker_walk num_envs: 1 steps: 1e7 eval_every: 1e5 action_repeat: 1 time_limit: 0 prefill: 10000 image_size: [64, 64] grayscale: False replay_size: 2e6 dataset: {batch: 50, length: 50, oversample_ends: True} train_gifs: False precision: 16 jit: True
Agent
log_every: 1e4 train_every: 5 train_steps: 1 pretrain: 0 clip_rewards: identity expl_noise: 0.0 expl_behavior: greedy expl_until: 0 eval_noise: 0.0 eval_state_mean: False
World Model
pred_discount: True grad_heads: [image, reward, discount] rssm: {hidden: 400, deter: 400, stoch: 32, discrete: 32, act: elu, std_act: sigmoid2, min_std: 0.1} encoder: {depth: 48, act: elu, kernels: [4, 4, 4, 4], keys: [image]} decoder: {depth: 48, act: elu, kernels: [5, 5, 6, 6]} reward_head: {layers: 4, units: 400, act: elu, dist: mse} discount_head: {layers: 4, units: 400, act: elu, dist: binary} loss_scales: {kl: 1, reward: 1, discount: 1} kl: {free: 0.0, forward: False, balance: 0.8, free_avg: True} model_opt: {opt: adam, lr: 3e-4, eps: 1e-5, clip: 100, wd: 1e-6}
Actor Critic
actor: {layers: 4, units: 400, act: elu, dist: trunc_normal, min_std: 0.1} critic: {layers: 4, units: 400, act: elu, dist: mse} actor_opt: {opt: adam, lr: 1e-4, eps: 1e-5, clip: 100, wd: 1e-6} critic_opt: {opt: adam, lr: 1e-4, eps: 1e-5, clip: 100, wd: 1e-6} discount: 0.99 discount_lambda: 0.95 imag_horizon: 15 actor_grad: both actor_grad_mix: '0.1' actor_ent: '1e-4' slow_target: True slow_target_update: 100 slow_target_fraction: 1
Exploration
expl_extr_scale: 0.0 expl_intr_scale: 1.0 expl_opt: {opt: adam, lr: 3e-4, eps: 1e-5, clip: 100, wd: 1e-6} expl_head: {layers: 4, units: 400, act: elu, dist: mse} disag_target: stoch disag_log: True disag_models: 10 disag_offset: 1 disag_action_cond: True expl_model_loss: kl
atari:
task: atari_pong time_limit: 108000 # 30 minutes of game play. action_repeat: 4 steps: 1e7 eval_every: 1e5 log_every: 1e5 prefill: 200000 grayscale: True train_every: 16 clip_rewards: tanh rssm: {hidden: 600, deter: 600, stoch: 32, discrete: 32} actor.dist: onehot model_opt.lr: 2e-4 actor_opt.lr: 4e-5 critic_opt.lr: 1e-4 actor_ent: 1e-3 discount: 0.999 actor_grad: reinforce actor_grad_mix: 0 loss_scales.kl: 0.1 loss_scales.discount: 5.0 .*.wd$: 1e-6
dmc:
task: dmc_walker_walk time_limit: 1000 action_repeat: 2 eval_every: 1e4 log_every: 1e4 prefill: 5000 train_every: 5 pretrain: 100 pred_discount: False grad_heads: [image, reward] rssm: {hidden: 200, deter: 200} model_opt.lr: 3e-4 actor_opt.lr: 8e-5 critic_opt.lr: 8e-5 actor_ent: 1e-4 discount: 0.99 actor_grad: dynamics kl.free: 1.0 dataset.oversample_ends: False
debug:
jit: False time_limit: 100 eval_every: 300 log_every: 300 prefill: 100 pretrain: 1 train_steps: 1 dataset.batch: 10 dataset.length: 10