ArcFace loss inputs not clear

Arksyd96 commented 2 years ago

Hello! I know this might be a stupid question, but i'm really strugling to understand.

I have an implementation of a ViT and i want to train it with ArcFace loss on a small dataset just to test and nothing else. I took your implementation of ArcFace loss except that i don't understand what are the inputs (logits, labels).

I guess it's the image embedding infered by the ViT and it's class (one hot encoded) but this does not work .. ! i could download the code and try running to understand but god, this implementation costs a lot of time to run and data are huge.

thanks in advance!

jacqueline-weng commented 2 years ago

'logits' are the probabilities the current image belonging to different classes (one identity is a class). You may check ArcFace paper for details. https://arxiv.org/abs/1801.07698

Arksyd96 commented 2 years ago

So you mean that the output of the model should be a softmax and not an embedding ? i fixed the output of my ViT to 512 thinking that what i need in the end is an embedding right ? but if i give to arcFace an embed=512 and logits=80000 classes will it work ?

Arksyd96 commented 2 years ago

You say logits is the probabilities of the different classes, but according to this line target_logit = logits[index, labels[index].view(-1)] in arcface loss function, logits are 2 or more dimensional vector. No sens for me !

jacqueline-weng commented 2 years ago

So you mean that the output of the model should be a softmax and not an embedding ? i fixed the output of my ViT to 512 thinking that what i need in the end is an embedding right ? but if i give to arcFace an embed=512 and logits=80000 classes will it work ?

The output of backbone (VIT or R50) is an embedding (a feature vector), with which you can compute the distance between two vectors to measure the similarity of two faces. The logits, the output of Partial FC module in the code, are computed with the embedding indicating how likely this image belongs to each class. target_logit = logits[index, labels[index].view(-1)] This instruction is to filter the useless labels and corresponding images in the batch if I remember correctly. It is clear 'logits' has a shape of batchsize * num_all_class_on_card, which is consistent with what I have implied.

Arksyd96 commented 2 years ago

ok so if i understood, logits are classes probabilities: lets suppose 1000 classes for example logits.shape = (batch_size, 1000) ? and labels is the embedding ? why did they call it labels ? and still, even if i try what you're saying, i have errors.

This simple code for example doesn't work. Logits are not probabilities here, it's a one hot encoding but still should work the same way as probabilities

batch_size = 2
num_classes = 10
embed_dim = 512

logits = torch.randint(0, num_classes, (batch_size,))
logits = torch.nn.functional.one_hot(logits, num_classes)

labels = torch.randn(batch_size, embed_dim)

loss = ArcFace(s=64.0, margin=0.5)
loss_value = loss(logits, labels)
loss_value

By the way i don't use Partial_FC. I'm looking for the direct loss for training on a single gpu. All i want is just to make the run code for a project.

jacqueline-weng commented 2 years ago

import torch 
from losses import ArcFace

batch_size = 2
num_classes = 10
logits = torch.randn((batch_size, num_classes))
labels = torch.tensor([[5],[6]],dtype=torch.long)

ArcFaceLoss = ArcFace()
loss = ArcFaceLoss(logits, labels)
print(loss)

Partial FC is only a method of classification. Even if you only run on one card, you still need a classifier to turn your embeddings into classification results (logits).
After classification, you only have logits and labels. In this case, you have 2 images for a batch, then you would have 2 result labels (class 5 for the first image and class 6 for the second). I don't know why you think label is of shape (batchsize, embedding_dimension). Label has no relationship with embeddings.

Arksyd96 commented 2 years ago

First of all, i wanna thank you for the time you're giving to answer me, especially since the timezone is different between us haha.

I thought Partial FC (which i honestly did not read carefully) was a kind of strategy to make training faster, but i did not think of it as a component part of the network. So if i want to remove Partial FC, i need to add an extra MLP to their backbone i guess ? i mean .. i'm not interested in using Partial FC.

I don't know why you think label is of shape (batchsize, embedding_dimension): Because generally on this kind of application we want a final representation of an image as an embedding, that we store in a database. And all we do is computing some distance between to embeddings to see if it's similar or not (like cosine). And since the output of ViT backbone they coded is a 512 features vectors, it strengthened the idea in my head and confused me i guess.

Thank you now i see clearer, but i have one last thing to ask haha:

Would it be possible to train the ViT to classify (like it's done in this paper) and then remove the Partial FC to infere embeddings. I need actually embeddings as result and not logits !
Technical issue: MXNet returns a lot of errors, tried it on multiple machines but always does not install correctly. Is there any alternative for extracting .rec data ?

jacqueline-weng commented 2 years ago

First of all, answering questions helps me better understand the whole stuff, I'm willing to share and discuss to learn more. I think you should spend time reading the code and paper to understand the function of each component. It surely saves you time.

Partial FC is only a method of classification and you can use any classifiers to replace it.
I understand about 'label' concept in your application. Since the discussion is in the context of this project, label simply refers to the correct identity of a face image.
VIT is only one of the backbones. Backbones can be changed easily using config files. Although you directly use embeddings as output, you still need to compare similarity (cosine distance) between different labels which I think is just another form of 'logits'.
To raise technical issues, you'd better post a new issue with error message details, your working environment, etc.

Arksyd96 commented 2 years ago

Thank you, and yes i agree that it's better to read the paper. But the thing is that i don't have much time haha, i had a deadline so i started directly trying implementing the model. And as i told, i thought arcface was a classic distance metric like cosine or others, but seems like it's not exactly the same.

My goal is to compare between ViT and ResNet50 in term of "Equal Error Rate" and inference speed. We have an EER of 1.7% with a ResNet50+ArcFace and we're trying to implement a ViT to compare those backbones, i honestly don't think ViT is better on those kind of tasks. What do you think ? I thought going directly to a Fine tuned ResNet to reduce a little bit more the loss would be better than a ViT (first one was trained from scratch).

It'd be great if you had an idea about this, it could help me end this project. Authors forgot toshare the ViT Weights 🐌

jacqueline-weng commented 2 years ago

The performance of VIT and R50 is affected by many variables.

First, to make both training converge, you need to have a proper, often large batch size, which may not be easily obtained if you do not have good machines.
Also, VIT stands for a series of networks. To my knowledge, VIT_t was just half the parameters of R50. You may need to measure the FLOP and size of network to be fair.
The performance depends on the size of training datasets. VIT claims a better accuracy in larger datasets.
ArcFace loss is the most difficult to converge due to its angular margin nature.

deepinsight / insightface

ArcFace loss inputs not clear #2057