HumanCompatibleAI / imitation

Clean PyTorch implementations of imitation and reward learning algorithms
https://imitation.readthedocs.io/
MIT License
1.26k stars 239 forks source link

Continue to train PPO after GAIL #691

Closed Liuzy0908 closed 1 year ago

Liuzy0908 commented 1 year ago

Problem

Hi, I'm excited to use this amazing project.

I have an idea about GAIL-PPO. GAIL has the generator network and the discriminator network, while ppo has the actor network and the critic network. So GAIL's generator network can be used as a pre-trained network for PPO actor network.

However, when the GAIL's generator(i.e. ppo's actor) is trained well, the ppo's critic network is in a randomly initialized state.

So how do I train the ppo's actor and critic network when I have just trained gail's generator and discriminator? (i.e. how find some way to fine tune ppo's critic network before continuing to train ppo's actor net.)

Yours sincerely

Solution

Possible alternative solutions

Find some way to fine tune ppo's critic network before continuing to train ppo's actor net?

ernestum commented 1 year ago

I think this is better asked in the SB3 repo since it is about using PPO and not about using any of the imitation algorithms. However I think this will not work out-of-the box with SB3 either and you probably have to tweak it a bit.

Liuzy0908 commented 1 year ago

Thank you for your response.

Do you have experience making simple modifications to the PPO actor's loss in imitation or SB3?

actor_loss_from_disc =      -disc(exp_states, actor(exp_states)).mean()        # actor's loss1: - discriminator value.
actor_loss_from_critic =    -critic(exp_states, actor(exp_states)).mean()      # actor's loss2: - q value.

# PPO actor's Loss: weighted sum
PPO_actor_loss = weight * actor_loss_from_disc + (1-weight) * actor_loss_from_critic
ernestum commented 1 year ago

Hi @Liuzy0908. I would like to kindly ask you to move this question to the SB3 repository. We don't have the capacity to help with this here right now.