Closed ernestum closed 10 months ago
@ernestum note that https://github.com/HumanCompatibleAI/imitation/pull/766 replaces training of expert policies in tutorials with download of pretrained policies from HF. So some / most(?) instances of increase this value to x to get actually good results
will be gone with these changes
also potentially relevant for this issue: performance issues with gail and airl as used in the tutorials https://github.com/HumanCompatibleAI/imitation/pull/766#discussion_r1285616315
Thanks! I will re-check those when #766 is merged!
Hey, I am looking into this issue and just wanted to confirm the criteria for solving it. So my plan would be, for each tutorial:
Does that sound right, @ernestum @AdamGleave?
Closed in error -- still need to fix preference comparison tutorials, working on in #771
5a_train_preference_comparisons_with_cnn (crashes due to missing Atari)
I can run this notebook fine without any issues related to missing Atari. Maybe this occurs when installing packages with pip install -e .
instead of pip install -e ".[dev]"
since this doesn't include Atari?
I was playing around with 5a_train_preference_comparisons_with_cnn.ipynb
and it seems like evaluate_policy
and rollout.rollout
genearte different returns, even when both are given the same environment. As shown in the image below, rollout.rollout
returns larger returns more often.
I'm not sure if this is intentional or a bug. I can investigate this more or create a Github issue, but I wanted to check with people first in case I'm missing something.
Eta: Will check to make sure the same reward function is being used.
Another thing regarding 5a_train_preference_comparisons_with_cnn.ipynb
: With the default hyperparameters it gets an average reward of around 0.75-1.5. This PPO policy on huggingface gets around 2.1.
I've tried messing with the hyperparameters (increasing length of reward model learning and + learner learning) but wasn't able to noticeably increase performance. If this is useful, I could do a hyperparameter sweep and update the hyperparams.
Some of the tutorials contained hyperparameters, that were not quite optimized. Also in some cases we say "increase this value to
x
to get actually good results". We should verify that those claims are actually true.