Closed SteveTanggithub closed 1 year ago
Thanks for the question.
We are tackling a multi-label task on AudioSet, meaning that we predict individual probabilities for the 527 classes, i.e. we solve 527 binary tasks. It is therefore convenient to use BCEWithLogitsLoss as it assumes a binary task and combines Sigmoid activation and BCELoss in a single class which improves numerical stability.
Now, why CE instead of KL-div.?
These two are related to each other since CE = KL-div. + label entropy. See for example here for further details: https://stats.stackexchange.com/questions/357963/what-is-the-difference-between-cross-entropy-and-kl-divergence
Why do you use nn.BCEWithLogitsLoss for distillation loss while other KD works using F.kl_div? Is it special setting for audio classification task?