Optimal hyperparameter selection for triplet loss training

varun-parthasarathy commented 5 years ago

Hi @davidsandberg! Thanks for your work on this repo!

When training with triplet loss using the recommended hyperparameters in the wiki, what kind of results were obtained? It'd be great if I could take a look at those results.

I'm also curious about what kind of performance others have gotten using triplet loss on VGGFace2. I can't seem to find the optimal hyperparameters that come even remotely close to the classifier model.

douxiaotian commented 5 years ago

same question! I find the accuracy and val_rate are far lower than what the paper indicated.

varun-parthasarathy commented 5 years ago

@douxiaotian what kind of results did you get? I'm currently trying out a cyclical learning rate with SGD, but it'll take a while to finish training. I'm planning to try out some other optimizers (like AdamW) instead once it's done. It might give better results that way.

douxiaotian commented 5 years ago

i got accuracy around 80% and tal rate lower than 10%. I think the problem might be lack of data. I am not sure how Google get the huge amount of data

varun-parthasarathy commented 5 years ago

Yeah, I think so too. I'm currently downloading the Deepglint dataset (Cleaned MS-Celeb + Asian Celeb; ~7 million images, ~180,000 identities) - my previous experiment with SGD failed miserably. I'll try to train using a CPU cluster; maybe I'll be able to increase the batch size that way.

kifaw commented 5 years ago

Hi, I tried currently to train the model using triplet loss script, using only CASIA Webface dataset (the clean version) for the training but it seems that the validation on the LFW is so low, it is around 10%~18% accuracy after more than 350 epochs. I did same alignment for both datasets with the descriptions given in the wiki, i tried using two different optimizers, RMSPROP and ADAM and both give low validation on LFW while I get around 80%~88% accuracy on the training set. Does anyone have an idea how to solve this problem ? Is it possible to fine tune using the pretrained model with the softmax loss ?

xlphs commented 5 years ago

I used these params to finetune and they worked great, actually overfitted and I had to go back and pick earlier epochs.

--keep_probability 0.6 \
--weight_decay 5e-4 \
--people_per_batch 720 \
--images_per_person 5 \
--batch_size 210 \
--learning_rate 0.002 \
--learning_rate_decay_factor 0.98 \
--learning_rate_decay_epochs 10 \
--optimizer MOM \
--embedding_size 512

kifaw commented 5 years ago

@xlphs Did you train using the triplet loss script or the softmax script ? And how have you come to choose those specific hyperparameters ?

xlphs commented 5 years ago

@kifaw The triplet loss script of course. I went through a bunch of old issues on triplet loss and ended up with that

varun-parthasarathy commented 5 years ago

@xlphs your results seem promising! Just to clarify, what dataset did you finetune on? Also, have you tried training from scratch at any point?

xlphs commented 5 years ago

@Var-ji I merged folders from VGGFace2 and the deepglint asian celebrity dataset. Tried training from scratch but it didn't work, I ended up taking the softmax model and finetune it with arcface loss and finally triplet loss. Triplet loss is easy to overfit and I forgot to remove the overlap VGGFace2 has with LFW, so my accuracy of 99.7% on LFW is not that reliable, but then again you only have like a dozen pairs at this point so imho it's not worthwhile to pursue higher accuracy, but rather try on some other datasets. (Triplet loss gave like 0.1% accuracy increase after arcface loss)

kifaw commented 5 years ago

@Thank you for sharing with us what have you done I appreciate that ! So it seems that you used the pretrained model provided in this repo isn't it ? And just to clarify, by removing the overlap that a dataset has with LFW how does the validation on the LFW works while none of the identities on LFW were in the training set ?

varun-parthasarathy commented 5 years ago

@kifaw the idea of validation is to see how well the model generalizes on data it hasn't seen before, so if the overlap is still present, then there will be some bias associated with the model results since it has, in fact, seen that data before. Thanks to this, validation accuracy will be higher than expected.

varun-parthasarathy commented 5 years ago

I ran a learning rate range test a while back; the results are interesting - final_plot Does this mean larger learning rates would perform well? Can someone clarify this? This was run with a batch size of 120 on VGGFace2, with people_per_batch=90, images_per_person=40 and SGDW with Nesterov momentum as the optimizer.

I also wanted to point out something I realized - the FaceNet paper only selects random semi-hard triplets for training; the default method in the code selects both semi-hard as well as hard triplets. Is it possible that this is what's leading to poor convergence?

kifaw commented 5 years ago

@Var-ji Thanks for clarifying, so when using the LFW dataset to validate, it compare between the identities in the LFW dataset itself, is this what it does ? For the plot you showed us above, I find it strange having better performance while the learning rate is high, while what I know is that the learning rate should be a small value ! Does it have any explanation ? Thanks again for your replies!

varun-parthasarathy commented 5 years ago

@kifaw that's something I unfortunately don't understand myself. I'm running some more range tests right now using the FaceNet triplet selection method, but I find it strange that the learning rate seems like it can be increased up to 2 without causing too much fluctuation in training. However, one possibility is that we keep selecting new triplets as we train, and thanks to this the loss values will decrease, even though they become more and more noisy.

varun-parthasarathy commented 5 years ago

I guess there were some issues with the range test (I didn't run it for long enough). I ran it for about 20000 steps and got a more reasonable range of 0.075 to 0.4. I also got a chance to ask one of the authors of the paper about the hyperparameter settings, and he said that while he can't give any generic settings, the learning rate for triplet loss is generally always higher than what would be used for a softmax-based classifier.

neklom commented 5 years ago

@Var-ji what do you mean by the range test ? I tried training using triplet loss three times using train_tripletloss.py script using casia webface dataset for more than 300epochs but it doesn't give any results in validation. Have you got any good results lately ?

varun-parthasarathy commented 5 years ago

@neklom the range test essentially involves slowly increasing the learning rate over time, while tracking loss vs. learning rate. At a certain value of the learning rate, loss falls drastically and then levels off. The range of learning rate values for which this drop is seen is the optimal range for training your network. I haven't gotten any good results lately. While I can get accuracies of 90+%, validation generally is about 20% at best.

neklom commented 5 years ago

@Var-ji thanks for the explanation. I'm facing the same issue for the training, i get more than 86% acc on training using the casia webface and aligning it as it was mentioned in the wiki, same for LFW, but i get like from ~11% to ~18% on validation and I don't get what's the problem. I hope if anyone can help us with this !

xlphs commented 5 years ago

@Var-ji That validation rate can’t be right. I trained many times and val rate is always slightly below accuracy once it goes above 0.9, so if acc is 0.99 then val rate might be 0.98. You should examine the actual value of your embeddings, perhaps they became extremely small.

varun-parthasarathy commented 5 years ago

@xlphs From my experience, training seems to become unstable once accuracy crosses 0.9 - the validation rate starts fluctuating wildly between 0.2 and 0.5. I generally stop training at this point; however it is possible that near the optimal minima, the gradient becomes quite bumpy. I'll try continuing training beyond this point and seeing what happens.

neklom commented 5 years ago

@xlphs you've had those results with the parameters you provided above and using arcface loss too ? And what about the alignment, how did you align the data ? Can you share with us your final code please, it's been weeks I'm trying to solve this problem and thank you !

xlphs commented 5 years ago

Here's tensorboard lfw graphs using arcface loss, don't have logs anymore for others, triplet loss looks similar enough. I use training scripts from this repo, the code is good enough especially train_tripletloss.py notice it doesn't use tricks like random flip or subtract mean to increase lfw accuracy.

neklom commented 5 years ago

Thank you @xlphs

varun-parthasarathy commented 5 years ago

Training from scratch with triplet loss gives an accuracy of about 92.5% (similar to OpenFace), while validation tends to vary between 35 to 40%, even after 800k iterations. I guess with small batch sizes, this is the maximum accuracy that can be reached.

kifaw commented 5 years ago

@Var-ji As I remember from reading Vggface2 paper, they trained their model from scratch using softmax loss first then they fine-tuned it using triplet loss after. Would it be effective ? What I've read before, that triplet loss needs training for many epochs, more like 1000 or more that's why it doesn't give a good accuracy on the validation set, would it be ? And what do you mean by iterations here ?

varun-parthasarathy commented 5 years ago

From what I've read so far, training with softmax and then fine-tuning using triplet loss can be very effective; the problem is that when you have a large number of classes, training using softmax becomes problematic. If you're training with softmax on VGGFace2, then fine-tuning on your own dataset it should be fine, although I haven't tested this yet. From my experiments, I think that increasing the embedding size can boost triplet loss performance. While the paper showed decreasing performance with increasing embedding size, I think there's a trade-off between the dataset size and embedding dimensions - when the dataset is small, it's better to capture a larger number of partially relevant features than capturing a small number of highly relevant features. If you want to use a small embedding size, you'd have to train for much longer, but that brings in the risk of over-fitting.

kifaw commented 5 years ago

I'll try using a high embedding size maybe it would be effective as you said, but Openface used 128D embedding vector on CASIA webface and they had a good accuracy though, would it be a matter of architecture used maybe ?

varun-parthasarathy commented 5 years ago

I don't really think so - I was able to replicate the OpenFace results (as I mentioned 2 days ago) using a 128D embedding; however, it did take nearly 800,000 iterations to reach that point. I'm currently training using a 512D embedding, and it reached 91% accuracy in only 60000 iterations, and can be expected to improve further from there. I would recommend using a cyclic learning rate, as it allows you to explore the loss function and find potentially better minima.

kifaw commented 5 years ago

I'll try your suggestions, thank you very much !

varun-parthasarathy commented 5 years ago

Just an update - I've managed to reach 95% accuracy and about 67% validation in 200k iterations after training from scratch using a 512D embedding. It does appear that the model may be overfitting though, so I'm increasing weight decay and testing again.

varun-parthasarathy commented 5 years ago

Seems that it wasn't over-fitting after all - the validation rate was just fluctuating a bit. If anyone is interested in trying this out, here are the parameters I used -

cycle_size: 50000
lfw_nrof_folds: 10
image_size: 224
pretrained_model: None
random_crop: True
model_def: models.inception_resnet_v1
batch_size: 120
optimizer: MOMW
weight_decay: 1e-06
max_nrof_epochs: 300
epoch_size: 3600
embedding_size: 512
max_lr: 0.4
moving_average_decay: 0.9999
people_per_batch: 360
gpu_memory_fraction: 1.0
cycle_policy: triangular2
lfw_dir: ../lfw_aligned
learning_rate: 0.0033
images_per_person: 30
alpha: 0.2
random_flip: True
lfw_pairs: data/pairs.txt
keep_probability: 0.8
seed: 666

There are some changes I made to the code for this -

cycle_size defines the step size to use for a cyclic learning rate. The value of 50000 is arbitrary; I chose it since I wanted the last 30000 iterations to be at the lower learning rate limit.
MOMW is just SGD with decoupled weight decay, and can be found in tf.contrib.opt.MomentumWOptimizer.
cycle_policy is the policy for the cyclic learning rate. The triangular2 policy seems to work best; although I haven't tried the exponential range version of CLR, the paper proposing CLR implied that triangular2 performs marginally better than exp_range.
Weight decay is set to 1e-6. Values such as 1e-5 seem to degrade performance. The CLR also provides a form of regularization, so weight decay can be kept small.
max_lr determines the maximum learning rate applicable for the CLR. 0.4 is the value obtained from range tests; this seems to work pretty well. learning_rate is used to get the lower bound for CLR, and is set to 0.0033, which was obtained by linearly scaling down the learning rate of 0.05 used in the original paper on the basis of batch size. I think this is small enough to get good results; reducing it by a further order of magnitude did not improve anything.
keep_probability is 0.8 as recommended for the Inception-Resnet architecture.
Setting embedding_size to 512 does seem to improve performance.

The dataset used is the combined Deepglint dataset. The dataset is used as is (since it's already cleaned and aligned), although images are first resized to 246 x 246, and then a random crop is applied to get an input size of 224 x 224. Validation is done on the aligned LFW dataset without mean subtraction.

While these parameters can probably be tuned even further, it's a pretty decent start for anyone who wants to try training using triplet loss from scratch - the results, while still not state-of-the-art, are still acceptable nevertheless. It'd be great if someone could corroborate the results though; once this is done, I guess this issue can be closed.

kifaw commented 5 years ago

@Var-ji That's so good, thank you for sharing those informations. Would you share the new code with us please ? and thanks again !

syorami commented 5 years ago

I ran a learning rate range test a while back; the results are interesting - Does this mean larger learning rates would perform well? Can someone clarify this? This was run with a batch size of 120 on VGGFace2, with people_per_batch=90, images_per_person=40 and SGDW with Nesterov momentum as the optimizer.

I also wanted to point out something I realized - the FaceNet paper only selects random semi-hard triplets for training; the default method in the code selects both semi-hard as well as hard triplets. Is it possible that this is what's leading to poor convergence?

I have also done some experiments using learning rate range test on triplet loss and I've got similar figures like yours. It seems that larger learning rate helps triplet loss to convergence. However, in my experiments, large learning rate causes the subspace to collapse and all the datapoints are squeezed to to a single point. Thus the loss stuck at the selected margin and the model fails to learn anything helpful.

I guess cyclical learning rate are not quite helpful in metric learning but with a ground truth label, this method can actually boost the training performance. But your results do surprise me.

I'm wondering did you encounter the same thing that the loss stuck at the margin? Did you compare cyclical learning rate with fixed-lr training and observe its superior performance?

Hoping for your clarification.

varun-parthasarathy commented 5 years ago

@tmac1997 I managed to reach a final accuracy of about 97.1% accuracy, and a validation rate of 81.2% using a cyclic learning rate. The model converged pretty well - I didn't encounter any problems with the loss being stuck at the margin; towards the end, it averaged around 0.008 or so. I trained using semi-hard triplets, not hard ones (which seems to be the default setting in the code).

I compared CLR with the default exponentially decaying LR, and found that using CLR was definitely better, especially when it came to reaching convergence. I trained my model for 1 million steps, but the model came very close to convergence within about 700k steps itself.

Judging from the fact you said about getting a similar figure from the LR range test, I think the dataset being used for training also plays a role in determining the range. When I ran the range test on the Deepglint dataset, I got a range of 0.075 to 0.4. I recommend setting the lower bound to one order of magnitude below the actual value obtained from the LR test so that in the later stages of training, weight updates are smaller.

I also tested several network architectures, including Xception and even Efficientnets, but they all performed poorer than Inception-Resnet-v1; I suspect this is more due to the fact that the batch size automatically becomes smaller to fit training onto a single GPU.

iheo commented 5 years ago

@Var-ji thanks for sharing your experiment!. I have couple questions,

For 1 million steps, how many epochs are?
is your recommendation for learning rate [0.0075, 0.4]?
did you train it from scratch or a pretrained model?

varun-parthasarathy commented 5 years ago

@iheo There isn't really any concept of epochs here; the number of steps is equal to max_nrof_epochs * epoch_size. I used a learning rate range of [0.003, 0.4], but 0.0075 should also work fine. The model was trained from scratch.

syorami commented 5 years ago

@Var-ji Thanks for your details! After a hard time of exploring, I finally found the reason is that I used a pretrained backbone and large lr in CLR did irretrievable damage to backbone weights, leading to the mdoel collapse. After using slice CLR for different parts of the model, I successfully trained the model.

BTW, I got another question. Did CLR with SGDM outperform Adam (or other optimizers) with step decay LR policy also in triplet loss? In my experiments with pretrained model, I had to say that both ways performed equally in fine tune.

varun-parthasarathy commented 5 years ago

@tmac1997 I found that SGDM with decoupled weight decay worked best with both CLR and step decay LR policies, when compared to Adam, AdamW, and RMSProp, at least when training from scratch.

iheo commented 5 years ago

@Var-ji Learning rate greater than 0.1 didn't work for me. I took a pretrained model and run the triplet training with alpha=0.1. I got so far LFW 98.6% and LFW val 88.3%. Learning rate was 0.05 when I hit to 88.3% LFW val. I would expect even better result with more training. I used the ADAM optimizer.

varun-parthasarathy commented 5 years ago

@iheo That's because you're using a pre-trained model; the learning rate has to be lower when fine-tuning a pre-trained model with triplet loss. The learning rate should be about one order of magnitude lower than the LR you'd use for training the network from scratch.

jetsmith commented 4 years ago

i used MS1M-ArcFace to train resnet50-faceid-model with triplet loss, has two questions:

should i train jointly with triplet loss and softmax cross-encropy or only triplet loss training without classifier layer? which produce better results? i found facenet use only tirplet loss
while i use triplet loss to fintune my softmax-based model, my triplet loss is too small compared with regularization loss. total loss = triplet loss + regular loss, for instance:

Epoch: [0][6/600] Time 0.700 Loss 0.540 Triplet Loss 0.013484 Epoch: [0][14/600] Time 3.034 Loss 0.545 Triplet Loss 0.018261

i wonder if it will influence my triplet loss training. is it ok that multipy triplet loss by a large coefficient to increase the loss weight, how about 10 or more?

varun-parthasarathy commented 4 years ago

@jetsmith I'd recommend training with softmax if you're looking for better accuracy, especially if you're training from scratch. However, I personally feel that softmax doesn't work as well as triplet loss during practical use when the dataset is small (I think it just reduces to something close to a pigeonhole problem). I'm still testing this point, though.

When you fine-tune with triplet loss, the triplet loss values will be low, so I think there's no need to scale it up. I haven't tried fine-tuning myself, but @xlphs would know better about this.

jetsmith commented 4 years ago

@jetsmith I'd recommend training with softmax if you're looking for better accuracy, especially if you're training from scratch. However, I personally feel that softmax doesn't work as well as triplet loss during practical use when the dataset is small (I think it just reduces to something close to a pigeonhole problem). I'm still testing this point, though.

When you fine-tune with triplet loss, the triplet loss values will be low, so I think there's no need to scale it up. I haven't tried fine-tuning myself, but @xlphs would know better about this.

@Var-ji , got it. About question 1, did you mean that we calculate two kinds of loss——triplet loss and softmax cross-entropy loss which both contribute to gradients?

varun-parthasarathy commented 4 years ago

@jetsmith I was referring to using only one of the two to train. Use softmax only to train for a while, and once that's done, use triplet loss only to fine-tune performance. Just make sure that you're lowering the learning rate before you start finetuning.

If you're looking to train using only triplet loss from scratch though, I've listed some of the optimal hyperparameters earlier in this thread - with the Deepglint dataset you should be able to reach a respectable validation rate and accuracy.

jetsmith commented 4 years ago

@Var-ji got it, thanks

taureanamir commented 4 years ago

@Var-ji , I want to use cyclic learning rate for fine-tune, but don't know how to start. Could you help me find a starting point.

Also while trying to finetune the softmax pre-trained model with triplet loss, my loss value seems stuck at 6 even after a million iterations. The loss is never below 6 even if I use the LR to be as high as 3. Have you faced something similar?

I used a subset of Deepglint dataset with around 4500 identities and 150k images. Following are my hyperparameter values.

--image_size 160 \ --model_def models.inception_resnet_v1 \ --optimizer ADAM \ --learning_rate -1 \ --max_nrof_epochs 1000 \ --gpu_memory_fraction 0.9 \ --epoch_size 1000 \ --batch_size 120 \ --keep_probability 0.5 \ --weight_decay 5e-4 \ --learning_rate_decay_factor 0.98 \ --learning_rate_decay_epochs 4 \ --embedding_size 512

Learning rate schedule: 0: 0.09 50: 0.02 100: 0.007 150: 0.0004 200: 0.00004 250: 0.000004 300: 0.000002 400: 0.000001

Your help will be highly appreciated. I am stuck with improving the performance on my own dataset since over two months.

varun-parthasarathy commented 4 years ago

@taureanamir I haven't faced any issues with large loss values - the loss values I got were low and consistently reduced during training. As for a starting point, here are some suggestions, but they will need to be tested first -

I think that your initial learning rate should be even lower - around 5e-4 could be a good start.
Consider using SGD with decoupled weight decay instead of Adam. I found that SGDW outperformed Adam, at least when training from scratch.
Run a learning rate range test to determine the optimal bounds for a cyclic learning rate. This is just increasing the learning rate linearly until training begins to degrade, then checking the bounds between which loss was reducing. However, I wouldn't really recommend using a cyclic learning rate for fine-tuning because the upper half of the cycle (when learning rate increases) could cause accuracy to fall.
When fine-tuning, did you try freezing the initial layers, or selecting different learning rates for each layer? This is just generic advice for fine-tuning on image datasets.
Lastly, check whether you're using hard triplets or semi-hard ones. The code in this repo uses hard ones by default, but training on semi-hard triplets (as per the original paper) gives better performance.

taureanamir commented 4 years ago

@Var-ji,

thanks for such a quick reply. I'll test the config you suggested and let you know the progress. Thanks again.

jjsjunior commented 4 years ago

@Var-ji I merged folders from VGGFace2 and the deepglint asian celebrity dataset. Tried training from scratch but it didn't work, I ended up taking the softmax model and finetune it with arcface loss and finally triplet loss. Triplet loss is easy to overfit and I forgot to remove the overlap VGGFace2 has with LFW, so my accuracy of 99.7% on LFW is not that reliable, but then again you only have like a dozen pairs at this point so imho it's not worthwhile to pursue higher accuracy, but rather try on some other datasets. (Triplet loss gave like 0.1% accuracy increase after arcface loss)

I'm sorry ask this here, but did you coded arcface or got from some repo? I've tried some arcface repos, but it didn't worked well.

davidsandberg / facenet

Optimal hyperparameter selection for triplet loss training #1012