DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.69k stars 1.65k forks source link

[Question] Modify actor‘s loss for GAIL-PPO #1394

Closed Liuzy0908 closed 1 year ago

Liuzy0908 commented 1 year ago

❓ Question

Hi, I'm excited to use this amazing project.

I'm implementing GAIL-PPO. GAIL has the generator network and the discriminator network, while ppo has the actor network and the critic network. So GAIL's generator network can be used as a pre-trained network for PPO actor network.

However, when the GAIL's generator(i.e. ppo's actor) is trained well, the ppo's critic network is in a randomly initialized state. Therefore, it would be harmful to use an randomly initialized critic network to train a PPO's actor network.

So my idea is to make a weighted sum of the loss of GAIL's discriminator and the loss of PPO's critic. Then use this weighted sum as the loss of PPO's actor.

Now I find that the loss of stable-baseline3 is not exposed, so how should I implement the modification loss?

Yours sincerely

Checklist

araffin commented 1 year ago

Hello, you should probably have a look at https://github.com/HumanCompatibleAI/imitation It also looks like best would be for you to fork SB3.

Liuzy0908 commented 1 year ago

Thank you for your timely response.

I have learned imitation/GAIL before the issue, but like the above problem, I can't multiply the loss.

Is there any way that I can simply modify the loss for PPO's actor network in SB3?

For example:


actor_loss_from_disc =      -disc(exp_states, actor(exp_states)).mean()        # actor's loss1: - discriminator value.
actor_loss_from_critic =    -critic(exp_states, actor(exp_states)).mean()      # actor's loss2: - q value.

# PPO actor's Loss: weighted sum
PPO_actor_loss = weight * actor_loss_from_disc + (1-weight) * actor_loss_from_critic