Open fikricanozgur opened 2 years ago
Hello, I would appreciate if you could give your comment on this question when you find the time. If you do not support this kind of inquires then kindly let me know and I would understand. Have a nice day.
Hi, sorry for the delay, I've been busy these days. I'll answer your question as soon as possible.
I am running several trainings to see if I can reproduce your curves. I want to make sure it's not an implementation error or a bug (both panda-gym and SB3). If it's clear on this side, then my guess is that for some runs, the agent is not exploring its environment enough. Maybe increasing the noise of the actions would give better results.
PandaPush-v1 TQC:
I have run 6 trainings on PandaPush-v1 with TQC, and I did not get such variability in results. I advise you to double check that the hyperparameters of your two trainings are equal.
Alright, thanks for looking into it. I will check the hyperparameters again and see if I can reproduce your results.
using the corresponding hyper-parameters in this repository
I also suspect that you did not use the hyperparameters advised in the zoo, because even your best run on Push is well below the curves I got. (At 200k timesteps: about 60% versus 20%)
Yes, that is rather surprising to me as well. I am working on a project and using SB3 for trainings, it might be that I changed something in the code or in the hyperparameters and then forgot about it. I will repeat the trainings on a fresh installation and report the results back here.
PandaPush-v1 TQC:
@qgallouedec could you share the runs and the command line you use? (so it's easier for @fikricanozgur to reproduce runs) maybe would be good to move the runs to openrlbenchmark (ask @vwxyzjn for access), we have a SB3 workspace here: https://wandb.ai/openrlbenchmark/sb3
I remember having some failure in the past (but not often), so I think it might depends on the random seed too.
@qgallouedec could you share the runs and the command line you use?
python train.py --algo tqc --env PandaPush-v1
(I also added --track --wandb-project-name panda -n 500000
but it shouldn't affect the results.)
maybe would be good to move the runs to openrlbenchmark (ask @vwxyzjn for access), we have a SB3 workspace here: https://wandb.ai/openrlbenchmark/sb3
I will :+1:
Some updates: the 7th run seems to replicate your problem.
For some seeds, it seems that we are reaching a kind of divergence situation. Here are the losses:
For some seeds, it seems that we are reaching a kind of divergence situation.
I've experienced that with SAC (and derivates) in the past, there are different solutions:
expln
solution with gSDE) to avoid explosionEDIT: maybe doing l2 regularization on the weights would help (so using "weight_decay" parameter of AdamW for instance (Adam implementation of weight decay is not completely right if I recall))
I also ran some tests and you can see them here.
In summary, I trained 8 agents for PandaPush-v1 with TQC where 4 of them were trained using the default hyperparameters and the remaining 4 with a slight change in the hyperparameters: I used offline sampling of HER transitions instead of online sampling. I also trained couple more agents with and without gSDE.
Results show that:
L2 regularization on the weights (using AdamW) with gSDE seem to solve the divergent behavior observed previously. Thanks @araffin.
Hello, good to hear =) could you share the hyperparameters used? (my guess is that l2 regularization should be enough, gSDE might not be needed, especially as we are not using it as intended)
Hi,
I used the TQC algorithm with the default hyperparameters in the repo and turned gSDE on. I changed the optimizer to AdamW using its default weight decay of 0.01. Another change I made was to lower the distance_threshold of the Push task to 5mm which is 5cm normally. I also had some algorithmic changes as part of the project I am working on. I am not sure how much they contributed.
@qgallouedec after some quick trials, I think we should change the default optimizer to AdamW
.
For the figure below, I used:
python train.py --algo tqc --env PandaPush-v1 -P --seed 2
vs
python train.py --algo tqc --env PandaPush-v1 -P --seed 2 -params policy_kwargs:'dict(net_arch=[512, 512, 512], n_critics=2, optimizer_class=th.optim.AdamW)'
so the only difference is the optimizer, the seed for runs 2-3 and 4-5 are the same (seed=1915415480 and seed=2). runs 2-4 is with Adam, runs 3-5 is with AdamW:
We would need more seeds to be sure (I couldn't reproduce the explosion using the W&B seed of your run though) but if the symptoms are loss explosion, then AdamW
with weight decay should definitely be a good solution.
Thanks for the suggestion. I’ll evaluate more precisely the impact of this new optimiser on the results (it is a good first use case of openrlbenchmark!). If it is confirmed, we will change the optimiser.
Hey @qgallouedec fwiw, you can tag the experiments with version control information, then use openrlbenchmark to filter the experiments of the new tag. For example, you could tag the current sb3 experiments with WANDB_TAGS=sb3-1.6.2
, then you can tag the experiments with the new optimiser with pr-322
. Finally, you can do
python -m openrlbenchmark.rlops \
--filters '?we=openrlbenchmark&wpn=sb3&ceik=env&cen=algo&metric=rollout/ep_rew_mean' \
'tqc?tag=sb3-1.6.2'
'tqc?tag=pr-322'
--env-ids HalfCheetahBulletEnv-v0 \
--ncols 1 \
--ncols-legend 2 \
--output-filename compare.png \
--report
We have a feature to tag experiments automatically here (https://github.com/vwxyzjn/cleanrl/blob/b558b2b48326d8bb8da7fa853914576ae4610f53/cleanrl_utils/benchmark.py#L38-L62)
See https://cleanrl-d978juk2k-vwxyzjn.vercel.app/advanced/rlops/ for more detail on the workflow.
Great, thanks!! I think it would be nice to add something similar in rl-zoo3.
❓ Question
Hello,
Training the Panda tasks (push, slide, pick&place) with TQC using the corresponding hyper-parameters in this repository sometimes does not converge to any meaningful behaviour and the success rate stays too low, see the plots below. I wonder if this is normal and if there are other parameters that I can modify to make the trainings more consistent?
Note that I am using v1.6.2.
Thanks!
Checklist