Pseudo-loss normalization

I've been trying to train this model from scratch (without ImageNet pretrained weights) against a custom dataset, using the FCOS backbone. I did a very long burn in (100k steps) to compensate for the missing pretraining, and saw good gains in AP (~25%), but I found that the AP would quickly go to ~0 or NaN once the semisupervised portion of training began.

After some debugging, I found that the pseudo-losses were much larger than the plain losses -- in particular, teacher_better_student_pseudo was ~1000x larger than the other losses, and pseudo_loss_fcos_cls was ~10x larger. Since these losses only kick in when the number of positive predictions is > 0, it seems like this exerts a huge pressure on the model for the teacher to never make any positive predictions, which it stops doing after a couple of thousand steps.

I found that completely disabling teacher_better_student_pseudo and reducing SEMISUPNET.UNSUP_LOSS_WEIGHT to .01 prevented these issues, and allowed the model to benefit from the unlabeled data. I wanted to flag it here since it took me a while to figure out, and it seems likely to impact other people working with this model.

Did you encounter any of these sorts of issues? Are there any other workarounds you would recommend? Is this potentially a side effect of a low-quality backbone, since I didn't use the ImageNet weights?

Should there be a term somewhere to normalize the pseudolosses by the number of positive pseudolabels, or something like that? In particular, teacher_better_student_pseudo seems like it could benefit from being normalized somehow, since it scales with both the batch size and the number of positive psuedolabels.

facebookresearch / unbiased-teacher-v2

Pseudo-loss normalization #7