A question about the biased models' negative gradient of its loss

GeraldHan / GGE

Code for Greedy Gradient Ensemble for Visual Question Answering （ICCV 2021, Oral）

MIT License

25 stars 2 forks source link

A question about the biased models' negative gradient of its loss #10

Closed Murphyzc closed 1 year ago

Murphyzc commented 1 year ago

The negative gradient of biased models' loss $-\nabla \mathcal{L}(\mathcal{H}_m)$ is considered as the pseudo label for the base model. And the $-\nabla \mathcal{L}(\mathcal{H}_m) = y_i-\sigma(\mathcal{H}_m)$ can relatively small if a sample is easy to fit by biased models, how can base model $f(X;\theta)$ pay more attention to samples that are hard to solve by biased classifiers $\mathcal{H}_m$?

In my view, if the negative gradient of biased model for every class $i$ tends to be zero, and this could be a zero vector as a pseudo supervision for base model, it force the model to classify the sample into empty class.

GeraldHan commented 1 year ago

Following the expression of CE or BCE loss, the loss will tend to be zero if the pseudo label is zero for all classes. It means that this sample will be ignored by the base model rather than further optimizing the model.

To solve the bias over-estimation problem, we present a simple solution at https://ieeexplore.ieee.org/abstract/document/10027464/

Murphyzc commented 1 year ago

Yes, I understand for the CE loss $\mathcal{L}(f(X),-\nabla \mathcal{L}(\mathcal{H}_m))=-\sum_j -\nabla \mathcal{L}(\mathcal{H}_m)_j \log(\sigma(f(X)_i))$, the negative gradient tends to be zero and the loss will be zero.

But for the BCE loss $\mathcal{L}(f(X),-\nabla \mathcal{L}(\mathcal{H}_m))=-\sum_j -\nabla \mathcal{L}(\mathcal{H}_m)_j \log(\sigma(f(X)_j)) + (1- (-\nabla \mathcal{L}(\mathcal{H}_m)_j ))(1-\log(\sigma(f(X)_j)) $, the first term tends to be zero, but the second term is not the same.

GeraldHan commented 1 year ago

Under BCE loss, it can also encourage the model to lower the biased estimation but will introduce extra biases (bias over-estimation). That is why the relaxed form (issue #5) works better than the actual gradient under BCE loss.

Murphyzc commented 1 year ago

OK, thanks for your reply! I didn't understand how negative gradients worked at the time, that's why I read the GGE , and now that I've figured it out, it's a really neat way of addressing biased problem.

Following the expression of CE or BCE loss, the loss will tend to be zero if the pseudo label is zero for all classes. It means that this sample will be ignored by the base model rather than further optimizing the model.

To solve the bias over-estimation problem, we present a simple solution at https://ieeexplore.ieee.org/abstract/document/10027464/