I tried to train your ViT implementation and other different backbones (like ConvNeXt, MaxViT, NFNet, CoAtNet, etc.) with the ArcFace loss function, and the loss and accuracies do not seem to converge. The loss either becomes stagnant at a value of about 20 or reduces to NaNs (with the default learning rate of 0.1 with SGD optimiser). The same backbones trained with CosFace loss are able to converge properly.
The ResNet backbones however, perform well when trained with ArcFace.
Any insights on why these losses perform so differently even though they are intuitively very similar, and how we can get the backbones to converge with ArcFace?
I tried to train your ViT implementation and other different backbones (like ConvNeXt, MaxViT, NFNet, CoAtNet, etc.) with the ArcFace loss function, and the loss and accuracies do not seem to converge. The loss either becomes stagnant at a value of about 20 or reduces to NaNs (with the default learning rate of 0.1 with SGD optimiser). The same backbones trained with CosFace loss are able to converge properly.
The ResNet backbones however, perform well when trained with ArcFace.
Any insights on why these losses perform so differently even though they are intuitively very similar, and how we can get the backbones to converge with ArcFace?
Any help would be highly appreciated @anxiangsir