Did anyone try to use ArcLoss without alignment?

borisgribkov commented 5 years ago

Hi all, I hope someone tried this. I have trained ResNet-18 with Softmax, Centerloss and finally ArcFace. VGG2 was used as training data, no alignment step has been done because we use another face detection approach. VGG2 test results are the following: 68% for SoftMax, about 89% for CenterLoss ( more than 20% boost) and surprisingly only 70% for ArcFace. I guess this caused because alignment step is missed and because of some reasons this step is extremely important for ArcFace and not so important for CenterLoss. But it's just an assumption. Does anyone have experience in training with no aligned images? Thank you!

nttstar commented 5 years ago

How did you measure the accuracy? Is it a close-set classification top-1 accuracy?

borisgribkov commented 5 years ago

Yes, that is right, Top-1, I have used VGG2 test set ( 500 persons, about 300 images per person ). Also I have calculated histograms for the distances between person's centroids ( an averaged vector for each person ) and distances within person`s clusters and I see that histograms intersection is much lower for CenterLoss for example. That is why I think that maybe modern angular loss functions (like ArcFace, CosFace etc.) are suitable for aligned data only. Also I have never seen that someone used it without alignment

borisgribkov commented 5 years ago

I guess I have found the solution, training process is very sensitive to M parameter. Will provide the plot next week.

Not-IITian commented 5 years ago

it would be very helpful to see the sensitivity to M parameter if you have such plot. thanks

borisgribkov commented 5 years ago

Going to close this issue, because of several reasons, the most important is: some conclusions were made after ArcFace training in Caffe, but now I see that MxNet and Caffe versions are completely different, moreover as I see now it's not possible to reproduce MxNet results with Caffe version.

@Not-IITian I have this plot for Caffe, but I guess this is useless.

Regarding MxNet ArcLoss usage for training with not-aligned images, I'm checking it now, but convergence is much harder than for aligned faces.

borisgribkov commented 5 years ago

Finally, I have tried ResNet-50 with ArcFace for VGG2 (not aligned) training. First, it can't be trained from scratch, only with SoftMax weights. Second, final result dramatically depends on M, final accuracy could be worse or better than for typical SoftMax. M = 0.3 is the optimum, in this case histograms ( distances between classes centriods and distances within centriods, this is for VGG2 test set ) intersction is twice lower than for SoftMax version. Next, I want to combine ArcFace with Spatial Transformer for alignment. @nttstar Thanks for the code provided, it's very useful.

nttstar commented 5 years ago

@borisgribkov Yes for vgg2, m=0.3 is the best.

shiyuanyin commented 5 years ago

@borisgribkov hello, you test is very useful ,I want to ask you test question about differrent dataset and M distances between classes centriods and distances within centriods. Thank you for your help "M = 0.3 is the optimum, in this case histograms ( distances between classes centriods and distances within centriods, this is for VGG2 test set ) intersction is twice lower than for SoftMax version."

borisgribkov commented 5 years ago

@shiyuanyin Ok, no problem, will be happy to help! And what is your question?

shiyuanyin commented 5 years ago

@borisgribkov
Thank you for your response, my question is how to calculate the distance ( ( distances between classes centriods and distances within centriods, this is for VGG2 test set ) ,and have the code to reference?

shiyuanyin commented 5 years ago

Thank you for your response, my question is how to calculate the distance ( ( distances between classes centriods and distances within centriods, this is for VGG2 test set ) ,and have the code to reference? 你好，谢谢你的回复，我是想问一下，不同的数据集和损失函数m，的类间距离和类内距离怎么计算，可视化出来。是否有代码可以参考

borisgribkov commented 5 years ago

For example, you have 10 different persons and 10 images for an each person, totally 100 images. You calculate embedding vectors for an each image (100 vectors), than average vectors for an each person - as a result you have 10 averaged vectors ( centroids ). Now you can calculate cosine distances between these centroids. Than calculate cosine distances between every 10 vectors related to the same person. Here no need to compare vectors for a different persons, just compare vectors within every person's directory. Finally you can plot two histograms, distances between centroids, and distances within centroids. Its intersection depicts your model quality, ideally it should be 0.

shiyuanyin commented 5 years ago

@borisgribkov Thank you very much for your help，I'll do it myself.

youthM commented 5 years ago

Hi，have you implemented to combine ArcFace with Spatial Transformer for alignment? @borisgribkov

borisgribkov commented 5 years ago

@youthM Yes, we finally implemented it, but in Caffe. Training process is quite tricky, but ST layer improves accuracy significantly. For Megaface it demonstrates ~10% better accuracy ( comparing with the situation without alignment at all ) for ResNet-18 (as far as I remember we reached 91% at 1E-6 ), Also we have made some tests with VGG2 test set and observed that ST layer alignment works better than traditional 5-key points. A bit strange but we didn't find any difference between Affine and Projective transforms, moreover I wold say Afiine is better, another result has been published here https://arxiv.org/abs/1701.07174

youthM commented 5 years ago

Great work. I also use Caffe, and do you mind releasing your caffemodel? I'm interested in it. @borisgribkov

borisgribkov commented 5 years ago

Not sure about model release, sorry. But you could use https://github.com/xialuxi/arcface-caffe this for ArcFace and https://github.com/mikuhatsune/e2e_face for ST layer, also you can fine net protofile there. Training process contains three steps: (1) softmax model, it will be initial weights for ArcFace, (2) ArcFace model - initial weights for ST trainig and (3) final training with ST layer, roughly speaking it's ST net training, because initial weights for the main network are taken from (2). Please note 2 things, use non-aligned data ( VGG2 or MSceleb ), read carefully the paper above about ST training process, you need to reduce learning rate for ST part of the network

youthM commented 5 years ago

Thank you very much for your reply, and it's helpful for me. there're some questions. what does softmax model refer to ? should VGG2 or MSceleb be cleaned? what's the difference between initial training and final training for ST layer? @borisgribkov

borisgribkov commented 5 years ago

1) you need weights to converge ArcFace training (2), so you need initial weights (see step 1), you can use softmaxwithloss or something else, we used centerloss for example. 2) we didn't clean VGG2 but use cleaned version of MSCeleb 3) I didn't converge ArcFace with ST training with a random weights initialization, that is why I suggest you to use pretrained ArcFace model as an initial weights for ST training, ST part will be initialized with a random weights

youthM commented 5 years ago

Not sure about model release, sorry. But you could use https://github.com/xialuxi/arcface-caffe this for ArcFace and https://github.com/mikuhatsune/e2e_face for ST layer, also you can fine net protofile there. Training process contains three steps: (1) softmax model, it will be initial weights for ArcFace, (2) ArcFace model - initial weights for ST trainig and (3) final training with ST layer, roughly speaking it's ST net training, because initial weights for the main network are taken from (2). Please note 2 things, use non-aligned data ( VGG2 or MSceleb ), read carefully the paper above about ST training process, you need to reduce learning rate for ST part of the network

hi, using non-aligned data refers to no affine? crop need?

borisgribkov commented 5 years ago

@youthM sorry, don't understand your question. For example, take VGG2 as is, did't apply any transforms to the images and make steps (1) - (3) as mentioned above. ST layer will make affine transforms for you, no any other actions needed. Crop is needed, yes, initial size of VGG2 images is 256, need to crop to 224. Hope it helps

youthM commented 5 years ago

@youthM sorry, don't understand your question. For example, take VGG2 as is, did't apply any transforms to the images and make steps (1) - (3) as mentioned above. ST layer will make affine transforms for you, no any other actions needed. Crop is needed, yes, initial size of VGG2 images is 256, need to crop to 224. Hope it helps

thanks, which version of mtcnn do you use to crop the images?

borisgribkov commented 5 years ago

I didn't use mtcnn, just make 224 center crop from 256 image. Read VGG2 paper for more details

WW2401 commented 5 years ago

Hi, when I train ST, I met some problems, the loss just dropped from 44.2862 to 24.0851 after 43300 Iterations, and the accuracy of training was always 0. Do you know why? Can you give me some help? Thanks a lot. @borisgribkov

I1023 14:11:26.380461 44078 solver.cpp:228] Iteration 42700, loss = 24.1595
I1023 14:11:26.380646 44078 solver.cpp:244]     Train net output #0: accuracy = 0
I1023 14:11:26.380658 44078 solver.cpp:244]     Train net output #1: accuracy-t = 0
I1023 14:11:26.380666 44078 solver.cpp:244]     Train net output #2: softmax_loss = 23.9899 (* 1 = 23.9899 loss)
I1023 14:11:26.508663 44078 sgd_solver.cpp:106] Iteration 42700, lr = 0.001
I1023 14:14:43.423959 44078 solver.cpp:228] Iteration 42800, loss = 24.1739
I1023 14:14:43.424156 44078 solver.cpp:244]     Train net output #0: accuracy = 0
I1023 14:14:43.424167 44078 solver.cpp:244]     Train net output #1: accuracy-t = 0
I1023 14:14:43.424176 44078 solver.cpp:244]     Train net output #2: softmax_loss = 24.1554 (* 1 = 24.1554 loss)
I1023 14:14:43.554633 44078 sgd_solver.cpp:106] Iteration 42800, lr = 0.001
I1023 14:18:00.082248 44078 solver.cpp:228] Iteration 42900, loss = 24.1898
I1023 14:18:00.083068 44078 solver.cpp:244]     Train net output #0: accuracy = 0
I1023 14:18:00.083094 44078 solver.cpp:244]     Train net output #1: accuracy-t = 0
I1023 14:18:00.083108 44078 solver.cpp:244]     Train net output #2: softmax_loss = 24.4256 (* 1 = 24.4256 loss)
I1023 14:18:00.211680 44078 sgd_solver.cpp:106] Iteration 42900, lr = 0.001
I1023 14:21:16.995638 44078 solver.cpp:228] Iteration 43000, loss = 24.193
I1023 14:21:16.995828 44078 solver.cpp:244]     Train net output #0: accuracy = 0
I1023 14:21:16.995841 44078 solver.cpp:244]     Train net output #1: accuracy-t = 0
I1023 14:21:16.995849 44078 solver.cpp:244]     Train net output #2: softmax_loss = 24.2949 (* 1 = 24.2949 loss)
I1023 14:21:17.127319 44078 sgd_solver.cpp:106] Iteration 43000, lr = 0.001
I1023 14:24:34.829201 44078 solver.cpp:228] Iteration 43100, loss = 24.1037
I1023 14:24:34.829411 44078 solver.cpp:244]     Train net output #0: accuracy = 0
I1023 14:24:34.829422 44078 solver.cpp:244]     Train net output #1: accuracy-t = 0
I1023 14:24:34.829432 44078 solver.cpp:244]     Train net output #2: softmax_loss = 24.0128 (* 1 = 24.0128 loss)
I1023 14:24:34.958714 44078 sgd_solver.cpp:106] Iteration 43100, lr = 0.001
I1023 14:27:52.216305 44078 solver.cpp:228] Iteration 43200, loss = 24.0942
I1023 14:27:52.216483 44078 solver.cpp:244]     Train net output #0: accuracy = 0
I1023 14:27:52.216495 44078 solver.cpp:244]     Train net output #1: accuracy-t = 0.03125
I1023 14:27:52.216508 44078 solver.cpp:244]     Train net output #2: softmax_loss = 23.8307 (* 1 = 23.8307 loss)
I1023 14:27:52.343477 44078 sgd_solver.cpp:106] Iteration 43200, lr = 0.001
I1023 14:31:09.334142 44078 solver.cpp:228] Iteration 43300, loss = 24.0851
I1023 14:31:09.334396 44078 solver.cpp:244]     Train net output #0: accuracy = 0
I1023 14:31:09.334409 44078 solver.cpp:244]     Train net output #1: accuracy-t = 0
I1023 14:31:09.334419 44078 solver.cpp:244]     Train net output #2: softmax_loss = 24.245 (* 1 = 24.245 loss)

borisgribkov commented 5 years ago

@WW2401 good to hear about your progress, your situation is quite common. First of all, did you successfully train Caffe ArcFace model without ST layer? You need to use it as an initial weights for ST training, because ST model can't be trained from scratch.

Second, in your case there is no convergence, usually it's a balance between learning rate for the main network and ST branch, for ST branch lr should be much lower. As I see lr = 0,001, try to use 0.01 instead, but use lr_mult: 0.001 or even 0.0001 for every trainable layer in ST branch.

WW2401 commented 5 years ago

@WW2401 good to hear about your progress, your situation is quite common. First of all, did you successfully train Caffe ArcFace model without ST layer? You need to use it as an initial weights for ST training, because ST model can't be trained from scratch.

Second, in your case there is no convergence, usually it's a balance between learning rate for the main network and ST branch, for ST branch lr should be much lower. As I see lr = 0,001, try to use 0.01 instead, but use lr_mult: 0.001 or even 0.0001 for every trainable layer in ST branch.

I successfully trained Caffe ArcFace model without ST layer and it's convergence. and the initial lr= 0.01 when I trained ST.

borisgribkov commented 5 years ago

@

@WW2401 good to hear about your progress, your situation is quite common. First of all, did you successfully train Caffe ArcFace model without ST layer? You need to use it as an initial weights for ST training, because ST model can't be trained from scratch. Second, in your case there is no convergence, usually it's a balance between learning rate for the main network and ST branch, for ST branch lr should be much lower. As I see lr = 0,001, try to use 0.01 instead, but use lr_mult: 0.001 or even 0.0001 for every trainable layer in ST branch.

I successfully trained Caffe ArcFace model without ST layer and it's convergence. and the initial lr= 0.01 when I trained ST.

Use it as an initial weights and please note about different LR for the main network and ST part. If it won't help try different LR and different LR_mult, these parameters depends on the dataset you use and neural net structure.

WW2401 commented 5 years ago

Yes, I used the ArcFace model without ST layer as an initial weights and I set different lr_mult (lr_mult=0.1 for ST)for the main network and ST part when training. The following is the loss of ArcFace model without ST layer. @borisgribkov arc10_trainloss

borisgribkov commented 5 years ago

@WW2401 sounds good, use lr_mult 0.001 or even lower, 0.0001

WW2401 commented 5 years ago

@WW2401 sounds good, use lr_mult 0.001 or even lower, 0.0001

Thanks a lot.

WW2401 commented 5 years ago

@borisgribkov I tried lr_mult 0.001 even 0.0001, but it is still no convergence. Before the iteration ended, the loss kept around 24 (23.7368 after 30000 iterations, total 65000 iterations). The accuracy of training was still 0. The loss is as follows, maybe the first stepvalue should be smaller? I set the first stepvalue to 35000. st_trainloss

borisgribkov commented 5 years ago

@WW2401 What dataset did you use for training? MS Celeb? I checked my parameters, 0.02 base learning rate and 0.00005 lr_mult for ST branch. For VGG2 I used lr_mult = 0.0001 with the same base lr. So, try to use lower base_lr and lr_mult, as far as I remember convergence behavior can be similar like your graph, but after spike at 7500 iterations you should see smooth decreasing of the loss value ( in your case above there is no decrease ). Second moment, I did't try large networks, like Resnet-50 and so on, I tried mobilenet, Resnet-18 and something between Resnet 18 and 34. For larger nets maybe you should use another params.

WW2401 commented 5 years ago

@WW2401 What dataset did you use for training? MS Celeb? I checked my parameters, 0.02 base learning rate and 0.00005 lr_mult for ST branch. For VGG2 I used lr_mult = 0.0001 with the same base lr. So, try to use lower base_lr and lr_mult, as far as I remember convergence behavior can be similar like your graph, but after spike at 7500 iterations you should see smooth decreasing of the loss value ( in your case above there is no decrease ). Second moment, I did't try large networks, like Resnet-50 and so on, I tried mobilenet, Resnet-18 and something between Resnet 18 and 34. For larger nets maybe you should use another params.

I used CASIA-WebFace and Resnet-36 for training. base_lr was set to 0.01(I tried 0.001 but in this case I didn't use lower lr_mult. When I tried base_lr 0.01, I had tried to set lr_mult 0.001 and 0.0001).

borisgribkov commented 5 years ago

Unfortunately I did't try CASIA, but lr parameters depends on the dataset you use, for MS Celeb you should use lower lr_mult, than for VGG2, I would expect that 0.0001 should be enough for CASIA because it's more or less comparable with VGG2. I checked training parameters for ResNet-34 with ST, I also used base_lr 0.1 with 0.000001 lr_mult. At the moment I can suggest you to try lower base_lr, like 0.001 and low lr_mult like 0.0001 and even lower...

UPD, another idea, try smaller network like ResNet-18, convergence is better.

WW2401 commented 5 years ago

Unfortunately I did't try CASIA, but lr parameters depends on the dataset you use, for MS Celeb you should use lower lr_mult, than for VGG2, I would expect that 0.0001 should be enough for CASIA because it's more or less comparable with VGG2. I checked training parameters for ResNet-34 with ST, I also used base_lr 0.1 with 0.000001 lr_mult. At the moment I can suggest you to try lower base_lr, like 0.001 and low lr_mult like 0.0001 and even lower...

UPD, another idea, try smaller network like ResNet-18, convergence is better.

When I used lower base_lr and lower lr_mult, the loss droped very very slowly, and it seems to keep around 38 even it has an upward trend. How long did you cost for training ST? And did you train ST as the auother mentioned that the learning rate decay by 0.7 every 10000 iterations (https://arxiv.org/abs/1701.07174)? I tried to train ST as the above, and the following is the part of loss. It droped more slowly and I don't know if it could have a good convergence.

borisgribkov commented 5 years ago

@WW2401 Any news? Training of ResNet-18 with ST takes about 3 days with batch 64 and about 350K iterations, this is for 4 GPU Titan Xp server, but in case of success it converges quite quickly. Sorry, have no ideas how to help because you use another net and dataset, only can say that the balance between lr and lr_mult is needed. As I said before maybe try smaller network.

WW2401 commented 5 years ago

@WW2401 Any news? Training of ResNet-18 with ST takes about 3 days with batch 64 and about 350K iterations, this is for 4 GPU Titan Xp server, but in case of success it converges quite quickly. Sorry, have no ideas how to help because you use another net and dataset, only can say that the balance between lr and lr_mult is needed. As I said before maybe try smaller network.

I tried to train the net with the two steps, 1. Trained softmax model with ST 2. Used softmax model as an initial weights to train arcface model with ST. It converged. But I'm in trouble with making matcaffe, it failed to make mattest. T_T

golunovas commented 4 years ago

@borisgribkov Have you maybe tried to pretrain arcface model on aligned images and then train ST?

borisgribkov commented 4 years ago

@golunovas Do you mean usage on aligned images ( only! ) to train arcface with and without ST layer? No, didn't try this and can only suppose what will happen. Training ST layer with aligned images maybe increase accuracy a little, need to check. It's a bit strange because you have to, first, align face with 5-points and than do the same with ST layer.

golunovas commented 4 years ago

@borisgribkov not exactly. I mean the following pipeline:

Train arcface using aligned images.
Add ST layer and freeze/decrease lr for trained arcface model and continue training on unaligned images.

Though, adding ST layer for aligned images seems to be interesting as well because it might help with imperfect alignment

borisgribkov commented 4 years ago

I agree with the last thought. Regarding the pipeline you suggested, accuracy is mostly determined by the last training, in your case it's ST with non-aligned faces. So, you can use aligned ArcFace model as an initial weights, but convergence will be slower.

borisgribkov commented 4 years ago

@golunovas I would expect slower convergence in this case, but accuracy is mostly determined by the last training, in our case it's ST with non-aligned faces. So, finally, it should be more or less the same. Regarding your last thought, yes, I agree

golunovas commented 4 years ago

@borisgribkov thank you. Basically, my idea was based on the assumption that the 5-pts alignment is more or less optimal and it will help ST layer to learn faster. Btw, have you checked how images look like after ST layer? it would be interesting to see how different they are from the 5-pts aligned ones.

borisgribkov commented 4 years ago

@golunovas Yes, of course I checked it, sometimes it looks like after 5 points alignment, sometimes not, for example a bit rotated. As I remember ST layer always brings face area closer.

golunovas commented 4 years ago

@borisgribkov did ST layer give differently looking outputs for different images or you compared different training attempts?

borisgribkov commented 4 years ago

@golunovas ST network calculates affine ( or projective ) transforms matrix according to the input image, so, yes, ST layer gives different outputs for different images

golunovas commented 4 years ago

@borisgribkov ok, got it. thank you.

WW2401 commented 4 years ago

@borisgribkov not exactly. I mean the following pipeline:

Train arcface using aligned images.

Add ST layer and freeze/decrease lr for trained arcface model and continue training on unaligned images.

Though, adding ST layer for aligned images seems to be interesting as well because it might help with imperfect alignment

Hi, have you validated the idea? Did you find anything?

borisgribkov commented 4 years ago

@borisgribkov not exactly. I mean the following pipeline:

Train arcface using aligned images.

Add ST layer and freeze/decrease lr for trained arcface model and continue training on unaligned images.

Though, adding ST layer for aligned images seems to be interesting as well because it might help with imperfect alignment

Hi, have you validated the idea? Did you find anything?

Hi @WW2401 No, I didn't try ST layer for aligned images. No need to do for us, because we use unaligned data only.

WW2401 commented 4 years ago

@borisgribkov not exactly. I mean the following pipeline:

Train arcface using aligned images.

Add ST layer and freeze/decrease lr for trained arcface model and continue training on unaligned images.

Though, adding ST layer for aligned images seems to be interesting as well because it might help with imperfect alignment

Hi, have you validated the idea? Did you find anything?

Hi @WW2401 No, I didn't try ST layer for aligned images. No need to do for us, because we use unaligned data only.

Thank you. And have you tried to compare the results between arcface with ST layer (unaligned images) and arcface without ST layer (aligned images)? How about the result?

deepinsight / insightface

Did anyone try to use ArcLoss without alignment? #510