cvlab-stonybrook / Scanpath_Prediction

Predicting Goal-directed Human Attention Using Inverse Reinforcement Learning (CVPR2020)
MIT License
97 stars 22 forks source link

Choice of loss function #15

Closed Doch88 closed 3 years ago

Doch88 commented 3 years ago

@ouyangzhibo I have a theoretical question about the choice of the loss function for the generative part of the framework. Why did you use a standard minimax loss and not a Wasserstein Loss (like the GAIL in this paper)? Have you considered to use this loss to improve the training performance?

ouyangzhibo commented 3 years ago

@ouyangzhibo I have a theoretical question about the choice of the loss function for the generative part of the framework. Why did you use a standard minimax loss and not a Wasserstein Loss (like the GAIL in this paper)? Have you considered to use this loss to improve the training performance?

Thanks for the question! I actually have tried WGAN loss with GAIL, but I failed to make it converge for some reason. You could definitely try that. It may require some tuning of the hyper-parameters.

Doch88 commented 3 years ago

@ouyangzhibo I have a theoretical question about the choice of the loss function for the generative part of the framework. Why did you use a standard minimax loss and not a Wasserstein Loss (like the GAIL in this paper)? Have you considered to use this loss to improve the training performance?

Thanks for the question! I actually have tried WGAN loss with GAIL, but I failed to make it converge for some reason. You could definitely try that. It may require some tuning of the hyper-parameters.

Yeah, I noticed that using a WGAN-GP the losses explode and I don't know why. It seems that using a standard WGAN with weight clipping works well for now.

ouyangzhibo commented 3 years ago

@ouyangzhibo I have a theoretical question about the choice of the loss function for the generative part of the framework. Why did you use a standard minimax loss and not a Wasserstein Loss (like the GAIL in this paper)? Have you considered to use this loss to improve the training performance?

Thanks for the question! I actually have tried WGAN loss with GAIL, but I failed to make it converge for some reason. You could definitely try that. It may require some tuning of the hyper-parameters.

Yeah, I noticed that using a WGAN-GP the losses explode and I don't know why. It seems that using a standard WGAN with weight clipping works well for now.

That's great! How does WGAN loss perform in the GAIL framework? Is it better than standard GAN loss?

Doch88 commented 3 years ago

@ouyangzhibo I have a theoretical question about the choice of the loss function for the generative part of the framework. Why did you use a standard minimax loss and not a Wasserstein Loss (like the GAIL in this paper)? Have you considered to use this loss to improve the training performance?

Thanks for the question! I actually have tried WGAN loss with GAIL, but I failed to make it converge for some reason. You could definitely try that. It may require some tuning of the hyper-parameters.

Yeah, I noticed that using a WGAN-GP the losses explode and I don't know why. It seems that using a standard WGAN with weight clipping works well for now.

That's great! How does WGAN loss perform in the GAIL framework? Is it better than standard GAN loss?

I'm working in a different domain with a different dataset so I haven't done yet a full training with COCO-Search18 (I'm doing it at the moment that I'm writing this comment) using the WGAN loss. However now, with three little changes, it seems to converge even with WGAN loss using gradient penalty. Here's what I have done:

  1. In the function compute_grad_pen() on gail.py before using the gradient to calculate the penalty I flatten the vector using: grad = grad.view(mixup_states.size(0), -1) By doing this, the norm will be done for each batch, and not for each dimension of the input states like before. I also changed the mixed part of the gradient penalty calculation because originally it did not work.
  2. Following what is done here, I added a layer normalization after each Conv2D layer (except the last one) on the discriminator and a BatchNorm2D in the actor and critic networks.
  3. I disabled the normalization of advantages.

Like I said I haven't done a full training with COCO-Search18, but I've been training for some epochs (~15) and these are the validation results for now: immagine Compared with the one using the original loss it doesn't seem so much better for now, but I haven't yet tried to tune the parameters to obtain some better results. I'm using the same parameters from the JSON in the repository except for the lambda of gradient penalty, which I set to 100. With 10 or 5 the gradient explodes; with some other values maybe the results are better.