Closed Anuradha-Uggi closed 8 months ago
I am not sure what do you mean by the strategy one, but you can definitely try both and see which one works better. Our sampling strategy is first train the Reranker with global hard sample, then train with partial hard sample.
Okay. Thank you. I will try that out. I was trying an end-to-end training model. Keeping everything the same as the original code you provided here, except image resolution (changed from 480x640 to 320x240) and batch size (from 0.0001 to 0.00035), is drastically decreasing the validation recalls in the below order: epoch 1: rerank R@1: 55.1 epoch 2: rerank R@1: 46.1 epoch 3: rerank R@1: 21.8 epoch 4: rerank R@1: 7.6 Do you see why this could happen? Did you also observe this in your ablation? Please share your thoughts on this. I am excited to know where the bug could be. Many thanks!! Waiting for your reply!
We did not observe this. I saw your resolution is 320x240, do you mean 240x320, as the original images are 480x640. We need to keep the aspect ratio unchanged. I guess you mean learning rate by using 0.00035. This may have a large impact on the performance. The performance of epoch1 should be 66.1 for R@1 and it keeps improving. You might need to tune the learning rate if you are using different batch size per GPU and number of GPUs.
"I saw your resolution is 320x240, do you mean 240x320, as the original images are 480x640" : I haven't followed the aspect ratio. I will try correcting this.
" guess you mean learning rate by using 0.00035": This I followed: if new_batch_size = M x old_batch_size, then new_lr = sqrt(M) x old_lr, where M is a scalar. We don't have a deterministic relationship between lr and batch size, right? A few blogs say if bs increases, lr should be increased by following the above relation, and a few say if one increases, the other should decrease.
"The performance of epoch1 should be 66.1 for R@1 and it keeps improving.": This I may not see since the image resolution is brought down. This is my understanding. You can please correct me.
"You might need to tune the learning rate if you are using different batch size per GPU and number of GPUs.": I think, in the paper you wrote, you used 8 GPUs. So, I am using a single GPU. And the batch size you picked is 64. I used 8. Then 64/8 gpus = 8. Then the baseline lr (i.e., 0.0001) should work for a single GPU with 8 batch size, right? and the rerank_batch_size also I have reduced to 10 from 20.
Below are the results with lr=0.0001 for batch size=8, ran on a single GPU. This is how the trend is: Epoch[0]: 65.1; Epoch[01]: 63.6; Epoch[02]: 60.5; Epoch[03]: 56.9; Epoch[04]: 51.4; Epoch[05]: 52.3; Epoch[06]: 53.2, Epoch[07]: 54.7, Epoch[08]: 51.8, Epoch[09]: 53.6 Epoch[10]: 54.2, Epoch[11]: 52.7, Epoch[12]: 52.0, Epoch[13]: 54.5
So, its fluctuating. Plz suggest on where I am going wrong.
Did you correct the resolution issue? I suggest using a smaller lr.
Yes, I changed to (240x320). The above trend is for 0.0001 (Assuming you used 8 GPUs with total batch size 64). In the beginning of the training code (train_reranker.py), you wrote, cuda_available_device='1,2,3,0'. So, How many did you use for 64 batch size. 8 or 4?.
'I suggest using a smaller lr': Trying with 0.0001/sqrt(2)=0.0000707. Should wait for the results.
Thanks!!
The results do not seem to be right. I would suggest using a much smaller lr (e.g. 0.00001 ) first to ensure that the model is optimized properly. The "cuda_available_device='1,2,3,0'" is just comment. We use 8 GPUs for the training.
The lr scaling law is for SGD. It is still unclear which lr is the best for Adam or AdamW when using different settings. I would suggest tuning the lr with x0.1 strategy.
With lr = 0.00001 ==> 0.1x0.0001, below is the trend. Epoch[00]: R@1 = 51.8 Epoch[01]: R@1 = 67.7 Epoch[02]: R@1 = 72.8 Epoch[03]: R@1 = 74.7 Epoch[04]: R@1 = 77.6 Epoch[05]: R@1 = 76.4 Epoch[06]: R@1 = 78.9 Epoch[07]: R@1 = 76.8 Epoch[08]: R@1 = 75.1 Epoch[09]: R@1 = 78.8 Epoch[10]: R@1 = 75.3
Its increasing for the first few epochs, and then started fluctuating. lr = 0.0001 takes 1 hour/epoch, and now lr = 0.1x0.0001 takes 2.5 hour per an epoch. This is on a shared resource. Why this fluctuation could be?
Thanks!
This is common. Usually the optimization would not be a smooth curve. You can tune the lr and see if you can get a better result. Also 10 epochs are not enough. You might need to train for more epochs.
Yeah, currently it is training for epoch@71. The best so far is R@1 = 83.8%. This is with the lr = 0.00001, bs = 8. But again the training recall curve is not so smooth. It has a lot of fluctuations. If possible, could you please share the training recall curve for the original model you have reported, if you have saved it? It would help me understand the training better.
Many thanks!
I cannot find our original log now. I have the log for using 4 GPUs and the results are almost the same. You can check the log file attached. info.log
Thank you so much, Szhu. This helps a lot. Hope, this would be my last question to you.
In the log file above, I see training_lr = 0.0001. In the code you released, where you said you used 8 GPUs, the learning rate is 0.0001, which is the same as the one in the 4 GPUs case. I also see the total batchsize in 4 GPUs case is 32 (Each GPU gets 8 samples). And the total batch size in 8 GPUs case is 64 (Each GPU gets 8 samples). Is it because that the batch size on individual GPUs is maintained at the same (8 samples), and that is why you chose the same learning rate in both the cases?
This is just an empirical setting and I kept the lr unchanged. The result is almost the same.
Okay. So,
'Is it because that the batch size on individual GPUs is maintained at the same (8 samples), and that is why you chose the same learning rate in both the cases?': This need not be the case right? Since the parameter update happens on an aggregated total batch of gradients, If total batch size changes (Irrespective of the chunk of batch that each GPU gets), we have to tune for a new learning rate?
Not sure what do you mean, but you might need to figure it out by yourself.
I mean, when we use multi GPU training, we no need to worry about how many GPUs we use (as long as the batch size is divisible by this) to fine tune learning rate, right? This technique is for memory management. But for hyper params finetuning, total batch size is what we should look at right? Not the chunks of batch placed on individual GPUs, I mean?
The lr vs batch size is very tricky for multiple GPU training, especially with Adam optimizer. There is no guarantee that you will get the same results if you follow any scaling law. You will need to try different numbers and figure it out.
Hi,
Good job! Thanks for releasing the code out. I just had one doubt. We have two dependent ranking stages here (Global retrieval and Reranker). If we opt for an online mining and want to refresh the cache for every iteration, which one do you suggest to me among 1. sample triplets (every image in a batch can be a +ve, -ve, and a query) using global features (256 length embeddings) and pass the same triplets to Reranker, and find reranker loss, and 2. Sample triplets separately for global embeddings and reranker local features and compute the separate losses?
Many thanks!