HumanCompatibleAI / imitation

Clean PyTorch implementations of imitation and reward learning algorithms
https://imitation.readthedocs.io/
MIT License
1.26k stars 239 forks source link

GAIL with human demostrations #583

Closed Soha18 closed 1 year ago

Soha18 commented 1 year ago

Hello,

I'm working on training a gail agent with human expert demonstrations dataset (provided by robomimic repository - Lift task in robosuite simulator). I modified this dataset a little bit to have episodes with the same length by taking the last 50 values of each one. Unfortunately, I failed to get good results so far even with different settings of HP and discriminator setups and also (PPO,SAC) for the generator. when training with RL generated rollouts for the same environment it works better, the robot could finish the task successfully several times but with human demos it stuck when reaching the object in best cases. would you please advice or suggest about the reason of that and how to solve this problem.

Thanks in advance Soha

AdamGleave commented 1 year ago

It's unfortunately pretty common to see algorithms like GAIL, BC, etc work less well with actual human data than synthetic demos.

This is more a research question than an issue with imitation per se, so there's a limit to what we can do to help here. If you found another GAIL implementation to work better though we'd be interested to see that and troubleshoot it. But otherwise this may well just be an issue with GAIL in general rather than anything specific to our implementation.

That said, my general inclination for tackling problems like this would be:

  1. Tune RL hyperparameters on this environment with a ground-truth reward to make sure you've got that optimized first.
  2. Tune GAIL hyperparameters on the human demos, keeping the RL hyperparameters fixed (or if you have plenty of compute, you could tune them as well, but use the values found in 1 to set a reasonable search range). We've seen a lot of variance between seed, so you may want to do multiple seeds for each hyperparameter you sample and do an average or lower confidence bound.
  3. Try simpler environments if they exist. can you get GAIL working at all on human demos? If not, do other algos like BC work? What does the learned policy look like -- is it doing something at all sensible from rendering rollouts? What do the different losses look like -- is it the discriminator failing to learn, or the policy?

Hope this helps!

Soha18 commented 1 year ago

Thanks for your quick response ..

about the GAIL implementation: I would try variable horizon true option as suggested in https://arxiv.org/abs/2106.00672 (section 5) when working with human demos, but how should I set demo_batch_size which is the number of samples in each demonstration and it is in this case varied?

Thanks again and have a nice evening

AdamGleave commented 1 year ago

demo_batch_size is just used for training, it doesn't have to equal length of human demos.

With variable horizons you're going to have the issue that the sign of the discriminator output becomes an important hyperparameter as discussed in our docs. You might be able to mitigate this somewhat by learning a constant to add to the discriminator output when used for RL training, so effectively the system has to learn if the human demos are trying to finish or prolong the episode.

AdamGleave commented 1 year ago

This isn't really an imitation specific issue and it's been inactive for a while, so am closing. Feel free to open another issue if you find any specific bugs with our implementation!