face-analysis / emonet

Official implementation of the paper "Estimation of continuous valence and arousal levels from faces in naturalistic conditions", Antoine Toisoul, Jean Kossaifi, Adrian Bulat, Georgios Tzimiropoulos and Maja Pantic, Nature Machine Intelligence, 2021
https://www.nature.com/articles/s42256-020-00280-0
Other
260 stars 62 forks source link

knowledge distillation loss #3

Closed segalinc closed 3 years ago

segalinc commented 3 years ago

Hi, do you apply the KL divergence loss to both valence arousal and expression? Can you provide more details about it? For instance what do you pass to the loss? Do you create distributions from the prediction and draw a sample and pass them to the loss? digitize the predictions before passing them to the loss? pass the predictions as they are? Did you also use some temperature parameter? Thank you

antoinetlc commented 3 years ago

Hello,

Thank you for your questions. I used distillation only for the categorical emotions and not for the valence and arousal values. I replaced the cross entropy loss for categorical emotions by a KL divergence term between the teacher and student predictions (after having applied a softmax, in order to get probability distributions). The part of the loss that deals with valence and arousal stayed the same (see paper). I did not use a temperature parameter.

From my experience, this worked great with the AffectNet dataset. However, it might be data dependent and there might be other ways of doing the distillation such as :

It is hard to know what will work better as it is data dependent. I would recommend you to try on a validation set and see what gives you the best accuracy there to select your best model.

Hope it helps!

segalinc commented 3 years ago

Hi Antonie,

Thank you for the detailed reply, now all is much clear. In fact it wasn't clear to me if you applied it to VA because being regression sounded weird.

Really appreciated

Cristina

Sent from my OnePlus

On Sat, Mar 13, 2021, 14:39 Antoine Toisoul @.***> wrote:

Hello,

Thank you for your questions. I used distillation only for the categorical emotions and not for the valence and arousal values. I replaced the cross entropy loss for categorical emotions by a KL divergence term between the teacher and student predictions (after having applied a softmax, in order to get probability distributions). The part of the loss that deals with valence and arousal stayed the same (see paper). I did not use a temperature parameter.

From my experience, this worked great with the AffectNet dataset. However, it might be data dependent and there might be other ways of doing the distillation such as :

  • keep the cross entropy between the student prediction and label coming from the dataset and add a KL divergence term between the student and teacher predictions as a regularization (in this case, add a coefficient in front of the KL divergence term, to control the amount of regularization you want)
  • You could also add distillation for valence and arousal. In this case, I would use a L2 loss between the valence and arousal predictions of the teacher and student instead of the KL divergence as this is a regression problem (a KL divergence with negative values for regression will break...). Again this distillation term could either replace the original loss with the label coming from the dataset or come on top of it as a regularization

It is hard to know what will work better as it is data dependent. I would recommend you to try on a validation set and see what gives you the best accuracy there to select your best model.

Hope it helps!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/face-analysis/emonet/issues/3#issuecomment-798796711, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHLPEW3V5LGPJKSH2JYLCLTDPSSHANCNFSM4ZBEKOZQ .

nlml commented 1 year ago

Hi there. Not sure if you'll see this but this aspect of your paper is still unclear to me.

You say you use knowledge distillation, but what are you distilling from? Did you train a teacher network first on AffectNet (i.e. no other datasets), and then use that network's predictions for the knowledge distillation loss component when training the student?

Thanks! Liam