Closed Murphyzc closed 1 year ago
Following the expression of CE or BCE loss, the loss will tend to be zero if the pseudo label is zero for all classes. It means that this sample will be ignored by the base model rather than further optimizing the model.
To solve the bias over-estimation problem, we present a simple solution at https://ieeexplore.ieee.org/abstract/document/10027464/
Yes, I understand for the CE loss $\mathcal{L}(f(X),-\nabla \mathcal{L}(\mathcal{H}_m))=-\sum_j -\nabla \mathcal{L}(\mathcal{H}_m)_j \log(\sigma(f(X)_i))$, the negative gradient tends to be zero and the loss will be zero.
But for the BCE loss $\mathcal{L}(f(X),-\nabla \mathcal{L}(\mathcal{H}_m))=-\sum_j -\nabla \mathcal{L}(\mathcal{H}_m)_j \log(\sigma(f(X)_j)) + (1- (-\nabla \mathcal{L}(\mathcal{H}_m)_j ))(1-\log(\sigma(f(X)_j)) $, the first term tends to be zero, but the second term is not the same.
Under BCE loss, it can also encourage the model to lower the biased estimation but will introduce extra biases (bias over-estimation). That is why the relaxed form (issue #5) works better than the actual gradient under BCE loss.
OK, thanks for your reply! I didn't understand how negative gradients worked at the time, that's why I read the GGE , and now that I've figured it out, it's a really neat way of addressing biased problem.
Following the expression of CE or BCE loss, the loss will tend to be zero if the pseudo label is zero for all classes. It means that this sample will be ignored by the base model rather than further optimizing the model.
To solve the bias over-estimation problem, we present a simple solution at https://ieeexplore.ieee.org/abstract/document/10027464/
The negative gradient of biased models' loss $-\nabla \mathcal{L}(\mathcal{H}_m)$ is considered as the pseudo label for the base model. And the $-\nabla \mathcal{L}(\mathcal{H}_m) = y_i-\sigma(\mathcal{H}_m)$ can relatively small if a sample is easy to fit by biased models, how can base model $f(X;\theta)$ pay more attention to samples that are hard to solve by biased classifiers $\mathcal{H}_m$?
In my view, if the negative gradient of biased model for every class $i$ tends to be zero, and this could be a zero vector as a pseudo supervision for base model, it force the model to classify the sample into empty class.