fschmid56 / EfficientAT

This repository aims at providing efficient CNNs for Audio Tagging. We provide AudioSet pre-trained models ready for downstream training and extraction of audio embeddings.
MIT License
229 stars 43 forks source link

Why do you use nn.BCEWithLogitsLoss for distillation loss? #18

Closed SteveTanggithub closed 1 year ago

SteveTanggithub commented 1 year ago

Why do you use nn.BCEWithLogitsLoss for distillation loss while other KD works using F.kl_div? Is it special setting for audio classification task?

fschmid56 commented 1 year ago

Thanks for the question.

We are tackling a multi-label task on AudioSet, meaning that we predict individual probabilities for the 527 classes, i.e. we solve 527 binary tasks. It is therefore convenient to use BCEWithLogitsLoss as it assumes a binary task and combines Sigmoid activation and BCELoss in a single class which improves numerical stability.

Now, why CE instead of KL-div.?

These two are related to each other since CE = KL-div. + label entropy. See for example here for further details: https://stats.stackexchange.com/questions/357963/what-is-the-difference-between-cross-entropy-and-kl-divergence