In theory, as long as the pseudo label has a negative correlation with the bias model prediction, it is able to mine the hard examples.
The wrong gradient in the paper is actually an approximation of $\nabla \mathcal{H}_i$. That's why it still works well.
What's reason about this statement "In theory, as long as the pseudo label has a negative correlation with the bias model prediction, it is able to mine the hard examples."?
Sorry for the wrong derivation of the negative gradient for Sigmoid+BCE loss. The correct negative gradient is
$$ \nabla \mathcal{H}_i= y_i - \sigma(\mathcal{H}_i) $$
In theory, as long as the pseudo label has a negative correlation with the bias model prediction, it is able to mine the hard examples. The wrong gradient in the paper is actually an approximation of $\nabla \mathcal{H}_i$. That's why it still works well.