Ensure all tutorials work as expected

HumanCompatibleAI / imitation

Clean PyTorch implementations of imitation and reward learning algorithms

https://imitation.readthedocs.io/

MIT License

1.3k stars 247 forks source link

Ensure all tutorials work as expected #763

Closed ernestum closed 10 months ago

ernestum commented 1 year ago

Some of the tutorials contained hyperparameters, that were not quite optimized. Also in some cases we say "increase this value to x to get actually good results". We should verify that those claims are actually true.

[x] 1_train_bc
[x] 2_train_dagger
[x] 3_train_gail
[x] 4_train_airl
[ ] 5a_train_preference_comparisons_with_cnn (crashes due to missing Atari)
[ ] 5_train_preference_comparisons (taking long)
[x] 6_train_mce
[x] 7_train_density (taking long)

jas-ho commented 1 year ago

@ernestum note that https://github.com/HumanCompatibleAI/imitation/pull/766 replaces training of expert policies in tutorials with download of pretrained policies from HF. So some / most(?) instances of increase this value to x to get actually good results will be gone with these changes

jas-ho commented 1 year ago

also potentially relevant for this issue: performance issues with gail and airl as used in the tutorials https://github.com/HumanCompatibleAI/imitation/pull/766#discussion_r1285616315

ernestum commented 1 year ago

Thanks! I will re-check those when #766 is merged!

michalzajac-ml commented 1 year ago

Hey, I am looking into this issue and just wanted to confirm the criteria for solving it. So my plan would be, for each tutorial:

If it's possible to quickly (say < 2 min) reach good performance (close to the expert) then just do it.
Otherwise, include "quick" and "slow" versions, so that the slow one reaches expert performance (but still works in, say, < 1 h).
If reaching the expert perf in (2) is not easily possible, still include "quick" and "slow" and report how good the results are.
Perform hyperparameter searches to optimize both "quick" and "slow" versions.

Does that sound right, @ernestum @AdamGleave?

AdamGleave commented 1 year ago

Closed in error -- still need to fix preference comparison tutorials, working on in #771

lukasberglund commented 1 year ago

5a_train_preference_comparisons_with_cnn (crashes due to missing Atari)

I can run this notebook fine without any issues related to missing Atari. Maybe this occurs when installing packages with pip install -e . instead of pip install -e ".[dev]" since this doesn't include Atari?

lukasberglund commented 1 year ago

I was playing around with 5a_train_preference_comparisons_with_cnn.ipynb and it seems like evaluate_policy and rollout.rollout genearte different returns, even when both are given the same environment. As shown in the image below, rollout.rollout returns larger returns more often.

I'm not sure if this is intentional or a bug. I can investigate this more or create a Github issue, but I wanted to check with people first in case I'm missing something.

Eta: Will check to make sure the same reward function is being used.

lukasberglund commented 1 year ago

Another thing regarding 5a_train_preference_comparisons_with_cnn.ipynb: With the default hyperparameters it gets an average reward of around 0.75-1.5. This PPO policy on huggingface gets around 2.1.

I've tried messing with the hyperparameters (increasing length of reward model learning and + learner learning) but wasn't able to noticeably increase performance. If this is useful, I could do a hyperparameter sweep and update the hyperparams.