Why do you use nn.BCEWithLogitsLoss for distillation loss?

fschmid56 / EfficientAT

This repository aims at providing efficient CNNs for Audio Tagging. We provide AudioSet pre-trained models ready for downstream training and extraction of audio embeddings.

MIT License

229 stars 43 forks source link

Thanks for the question.

We are tackling a multi-label task on AudioSet, meaning that we predict individual probabilities for the 527 classes, i.e. we solve 527 binary tasks. It is therefore convenient to use BCEWithLogitsLoss as it assumes a binary task and combines Sigmoid activation and BCELoss in a single class which improves numerical stability.

Now, why CE instead of KL-div.?

These two are related to each other since CE = KL-div. + label entropy. See for example here for further details: https://stats.stackexchange.com/questions/357963/what-is-the-difference-between-cross-entropy-and-kl-divergence

fschmid56 / EfficientAT

Why do you use nn.BCEWithLogitsLoss for distillation loss? #18