[Question] Inconsistent training of Panda manipulation tasks

fikricanozgur commented 2 years ago

❓ Question

Hello,

Training the Panda tasks (push, slide, pick&place) with TQC using the corresponding hyper-parameters in this repository sometimes does not converge to any meaningful behaviour and the success rate stays too low, see the plots below. I wonder if this is normal and if there are other parameters that I can modify to make the trainings more consistent?

Note that I am using v1.6.2.

Thanks!

slide W B Chart 11_29_2022, 10_45_20 AM

Checklist

[X] I have checked that there is no similar issue in the repo
[X] I have read the SB3 documentation
[X] I have read the RL Zoo README
[X] If code there is, it is minimal and working
[X] If code there is, it is formatted using the markdown code blocks for both code and stack traces.

fikricanozgur commented 1 year ago

Hello, I would appreciate if you could give your comment on this question when you find the time. If you do not support this kind of inquires then kindly let me know and I would understand. Have a nice day.

qgallouedec commented 1 year ago

Hi, sorry for the delay, I've been busy these days. I'll answer your question as soon as possible.

qgallouedec commented 1 year ago

I am running several trainings to see if I can reproduce your curves. I want to make sure it's not an implementation error or a bug (both panda-gym and SB3). If it's clear on this side, then my guess is that for some runs, the agent is not exploring its environment enough. Maybe increasing the noise of the actions would give better results.

qgallouedec commented 1 year ago

PandaPush-v1 TQC: PandaPush-v1_TQC

I have run 6 trainings on PandaPush-v1 with TQC, and I did not get such variability in results. I advise you to double check that the hyperparameters of your two trainings are equal.

fikricanozgur commented 1 year ago

Alright, thanks for looking into it. I will check the hyperparameters again and see if I can reproduce your results.

qgallouedec commented 1 year ago

using the corresponding hyper-parameters in this repository

I also suspect that you did not use the hyperparameters advised in the zoo, because even your best run on Push is well below the curves I got. (At 200k timesteps: about 60% versus 20%)

fikricanozgur commented 1 year ago

Yes, that is rather surprising to me as well. I am working on a project and using SB3 for trainings, it might be that I changed something in the code or in the hyperparameters and then forgot about it. I will repeat the trainings on a fresh installation and report the results back here.

araffin commented 1 year ago

PandaPush-v1 TQC:

@qgallouedec could you share the runs and the command line you use? (so it's easier for @fikricanozgur to reproduce runs) maybe would be good to move the runs to openrlbenchmark (ask @vwxyzjn for access), we have a SB3 workspace here: https://wandb.ai/openrlbenchmark/sb3

I remember having some failure in the past (but not often), so I think it might depends on the random seed too.

qgallouedec commented 1 year ago

@qgallouedec could you share the runs and the command line you use?

python train.py --algo tqc --env PandaPush-v1

(I also added --track --wandb-project-name panda -n 500000 but it shouldn't affect the results.)

maybe would be good to move the runs to openrlbenchmark (ask @vwxyzjn for access), we have a SB3 workspace here: https://wandb.ai/openrlbenchmark/sb3

I will :+1:

qgallouedec commented 1 year ago

Some updates: the 7th run seems to replicate your problem.

W B Chart 12_8_2022, 1 18 44 PM

For some seeds, it seems that we are reaching a kind of divergence situation. Here are the losses:

Screenshot from 2022-12-08 13-20-01

araffin commented 1 year ago

For some seeds, it seems that we are reaching a kind of divergence situation.

I've experienced that with SAC (and derivates) in the past, there are different solutions:

one is to clip the actor output (gSDE solution) / l2 regularization on its output (was in the original implementation)
another is to try to limit the standard deviation even more (or use expln solution with gSDE) to avoid explosion

EDIT: maybe doing l2 regularization on the weights would help (so using "weight_decay" parameter of AdamW for instance (Adam implementation of weight decay is not completely right if I recall))

fikricanozgur commented 1 year ago

I also ran some tests and you can see them here.

In summary, I trained 8 agents for PandaPush-v1 with TQC where 4 of them were trained using the default hyperparameters and the remaining 4 with a slight change in the hyperparameters: I used offline sampling of HER transitions instead of online sampling. I also trained couple more agents with and without gSDE.

Results show that:

1 training diverged from each group of 4 trainings, reproducing the result of the last run of @qgallouedec.
3 trainings with online sampling achieve a considerably higher success rate and reward mean as compared to the other 3 trainings with offline sampling. I wonder why this is the case. As far as I understand, the difference between offline and online sampling is about whether the HER transitions are stored in the replay buffer right after an episode ends in the simulation or they are created when needed during training. Do you have an idea why there could be such a difference between online and offline sampling of HER transitions?
gSDE seems to diminish how fast algorithms reach to +90% success rate. I don't know if this is an expected trade-off between data-efficiency and robustness to random seed?

fikricanozgur commented 1 year ago

L2 regularization on the weights (using AdamW) with gSDE seem to solve the divergent behavior observed previously. Thanks @araffin.

araffin commented 1 year ago

Hello, good to hear =) could you share the hyperparameters used? (my guess is that l2 regularization should be enough, gSDE might not be needed, especially as we are not using it as intended)

fikricanozgur commented 1 year ago

Hi,

I used the TQC algorithm with the default hyperparameters in the repo and turned gSDE on. I changed the optimizer to AdamW using its default weight decay of 0.01. Another change I made was to lower the distance_threshold of the Push task to 5mm which is 5cm normally. I also had some algorithmic changes as part of the project I am working on. I am not sure how much they contributed.

araffin commented 1 year ago

@qgallouedec after some quick trials, I think we should change the default optimizer to AdamW. For the figure below, I used:

python train.py --algo tqc --env PandaPush-v1 -P --seed 2

vs

python train.py --algo tqc --env PandaPush-v1 -P --seed 2 -params policy_kwargs:'dict(net_arch=[512, 512, 512], n_critics=2, optimizer_class=th.optim.AdamW)'

so the only difference is the optimizer, the seed for runs 2-3 and 4-5 are the same (seed=1915415480 and seed=2). runs 2-4 is with Adam, runs 3-5 is with AdamW: Training_Success_Rate

We would need more seeds to be sure (I couldn't reproduce the explosion using the W&B seed of your run though) but if the symptoms are loss explosion, then AdamW with weight decay should definitely be a good solution.

qgallouedec commented 1 year ago

Thanks for the suggestion. I’ll evaluate more precisely the impact of this new optimiser on the results (it is a good first use case of openrlbenchmark!). If it is confirmed, we will change the optimiser.

vwxyzjn commented 1 year ago

Hey @qgallouedec fwiw, you can tag the experiments with version control information, then use openrlbenchmark to filter the experiments of the new tag. For example, you could tag the current sb3 experiments with WANDB_TAGS=sb3-1.6.2, then you can tag the experiments with the new optimiser with pr-322. Finally, you can do

python -m openrlbenchmark.rlops \
    --filters '?we=openrlbenchmark&wpn=sb3&ceik=env&cen=algo&metric=rollout/ep_rew_mean' \
        'tqc?tag=sb3-1.6.2'
        'tqc?tag=pr-322'
    --env-ids HalfCheetahBulletEnv-v0 \
    --ncols 1 \
    --ncols-legend 2 \
    --output-filename compare.png \
    --report

We have a feature to tag experiments automatically here (https://github.com/vwxyzjn/cleanrl/blob/b558b2b48326d8bb8da7fa853914576ae4610f53/cleanrl_utils/benchmark.py#L38-L62)

See https://cleanrl-d978juk2k-vwxyzjn.vercel.app/advanced/rlops/ for more detail on the workflow.

qgallouedec commented 1 year ago

Great, thanks!! I think it would be nice to add something similar in rl-zoo3.

DLR-RM / rl-baselines3-zoo

[Question] Inconsistent training of Panda manipulation tasks #322

❓ Question

Checklist