bytedance / R2Former

Official repository for R2Former: Unified Retrieval and Reranking Transformer for Place Recognition
Apache License 2.0
83 stars 6 forks source link

Online triplet mining for Ranking and Re-ranking stages #13

Closed Anuradha-Uggi closed 8 months ago

Anuradha-Uggi commented 1 year ago

Hi,

Good job! Thanks for releasing the code out. I just had one doubt. We have two dependent ranking stages here (Global retrieval and Reranker). If we opt for an online mining and want to refresh the cache for every iteration, which one do you suggest to me among 1. sample triplets (every image in a batch can be a +ve, -ve, and a query) using global features (256 length embeddings) and pass the same triplets to Reranker, and find reranker loss, and 2. Sample triplets separately for global embeddings and reranker local features and compute the separate losses?

Many thanks!

szhu-bytedance commented 1 year ago

I am not sure what do you mean by the strategy one, but you can definitely try both and see which one works better. Our sampling strategy is first train the Reranker with global hard sample, then train with partial hard sample.

Anuradha-Uggi commented 12 months ago

Okay. Thank you. I will try that out. I was trying an end-to-end training model. Keeping everything the same as the original code you provided here, except image resolution (changed from 480x640 to 320x240) and batch size (from 0.0001 to 0.00035), is drastically decreasing the validation recalls in the below order: epoch 1: rerank R@1: 55.1 epoch 2: rerank R@1: 46.1 epoch 3: rerank R@1: 21.8 epoch 4: rerank R@1: 7.6 Do you see why this could happen? Did you also observe this in your ablation? Please share your thoughts on this. I am excited to know where the bug could be. Many thanks!! Waiting for your reply!

szhu-bytedance commented 12 months ago

We did not observe this. I saw your resolution is 320x240, do you mean 240x320, as the original images are 480x640. We need to keep the aspect ratio unchanged. I guess you mean learning rate by using 0.00035. This may have a large impact on the performance. The performance of epoch1 should be 66.1 for R@1 and it keeps improving. You might need to tune the learning rate if you are using different batch size per GPU and number of GPUs.

Anuradha-Uggi commented 12 months ago

"I saw your resolution is 320x240, do you mean 240x320, as the original images are 480x640" : I haven't followed the aspect ratio. I will try correcting this.

" guess you mean learning rate by using 0.00035": This I followed: if new_batch_size = M x old_batch_size, then new_lr = sqrt(M) x old_lr, where M is a scalar. We don't have a deterministic relationship between lr and batch size, right? A few blogs say if bs increases, lr should be increased by following the above relation, and a few say if one increases, the other should decrease.

"The performance of epoch1 should be 66.1 for R@1 and it keeps improving.": This I may not see since the image resolution is brought down. This is my understanding. You can please correct me.

"You might need to tune the learning rate if you are using different batch size per GPU and number of GPUs.": I think, in the paper you wrote, you used 8 GPUs. So, I am using a single GPU. And the batch size you picked is 64. I used 8. Then 64/8 gpus = 8. Then the baseline lr (i.e., 0.0001) should work for a single GPU with 8 batch size, right? and the rerank_batch_size also I have reduced to 10 from 20.

Anuradha-Uggi commented 12 months ago

Below are the results with lr=0.0001 for batch size=8, ran on a single GPU. This is how the trend is: Epoch[0]: 65.1; Epoch[01]: 63.6; Epoch[02]: 60.5; Epoch[03]: 56.9; Epoch[04]: 51.4; Epoch[05]: 52.3; Epoch[06]: 53.2, Epoch[07]: 54.7, Epoch[08]: 51.8, Epoch[09]: 53.6 Epoch[10]: 54.2, Epoch[11]: 52.7, Epoch[12]: 52.0, Epoch[13]: 54.5

So, its fluctuating. Plz suggest on where I am going wrong.

szhu-bytedance commented 12 months ago

Did you correct the resolution issue? I suggest using a smaller lr.

Anuradha-Uggi commented 12 months ago

Yes, I changed to (240x320). The above trend is for 0.0001 (Assuming you used 8 GPUs with total batch size 64). In the beginning of the training code (train_reranker.py), you wrote, cuda_available_device='1,2,3,0'. So, How many did you use for 64 batch size. 8 or 4?.

'I suggest using a smaller lr': Trying with 0.0001/sqrt(2)=0.0000707. Should wait for the results.

Thanks!!

szhu-bytedance commented 12 months ago

The results do not seem to be right. I would suggest using a much smaller lr (e.g. 0.00001 ) first to ensure that the model is optimized properly. The "cuda_available_device='1,2,3,0'" is just comment. We use 8 GPUs for the training.

szhu-bytedance commented 12 months ago

The lr scaling law is for SGD. It is still unclear which lr is the best for Adam or AdamW when using different settings. I would suggest tuning the lr with x0.1 strategy.

Anuradha-Uggi commented 11 months ago

With lr = 0.00001 ==> 0.1x0.0001, below is the trend. Epoch[00]: R@1 = 51.8 Epoch[01]: R@1 = 67.7 Epoch[02]: R@1 = 72.8 Epoch[03]: R@1 = 74.7 Epoch[04]: R@1 = 77.6 Epoch[05]: R@1 = 76.4 Epoch[06]: R@1 = 78.9 Epoch[07]: R@1 = 76.8 Epoch[08]: R@1 = 75.1 Epoch[09]: R@1 = 78.8 Epoch[10]: R@1 = 75.3

Its increasing for the first few epochs, and then started fluctuating. lr = 0.0001 takes 1 hour/epoch, and now lr = 0.1x0.0001 takes 2.5 hour per an epoch. This is on a shared resource. Why this fluctuation could be?

Thanks!

szhu-bytedance commented 11 months ago

This is common. Usually the optimization would not be a smooth curve. You can tune the lr and see if you can get a better result. Also 10 epochs are not enough. You might need to train for more epochs.

Anuradha-Uggi commented 11 months ago

Yeah, currently it is training for epoch@71. The best so far is R@1 = 83.8%. This is with the lr = 0.00001, bs = 8. But again the training recall curve is not so smooth. It has a lot of fluctuations. If possible, could you please share the training recall curve for the original model you have reported, if you have saved it? It would help me understand the training better.

Many thanks!

szhu-bytedance commented 11 months ago

I cannot find our original log now. I have the log for using 4 GPUs and the results are almost the same. You can check the log file attached. info.log

Anuradha-Uggi commented 11 months ago

Thank you so much, Szhu. This helps a lot. Hope, this would be my last question to you.

In the log file above, I see training_lr = 0.0001. In the code you released, where you said you used 8 GPUs, the learning rate is 0.0001, which is the same as the one in the 4 GPUs case. I also see the total batchsize in 4 GPUs case is 32 (Each GPU gets 8 samples). And the total batch size in 8 GPUs case is 64 (Each GPU gets 8 samples). Is it because that the batch size on individual GPUs is maintained at the same (8 samples), and that is why you chose the same learning rate in both the cases?

szhu-bytedance commented 11 months ago

This is just an empirical setting and I kept the lr unchanged. The result is almost the same.

Anuradha-Uggi commented 11 months ago

Okay. So,

'Is it because that the batch size on individual GPUs is maintained at the same (8 samples), and that is why you chose the same learning rate in both the cases?': This need not be the case right? Since the parameter update happens on an aggregated total batch of gradients, If total batch size changes (Irrespective of the chunk of batch that each GPU gets), we have to tune for a new learning rate?

szhu-bytedance commented 11 months ago

Not sure what do you mean, but you might need to figure it out by yourself.

Anuradha-Uggi commented 11 months ago

I mean, when we use multi GPU training, we no need to worry about how many GPUs we use (as long as the batch size is divisible by this) to fine tune learning rate, right? This technique is for memory management. But for hyper params finetuning, total batch size is what we should look at right? Not the chunks of batch placed on individual GPUs, I mean?

Jeff-Zilence commented 8 months ago

The lr vs batch size is very tricky for multiple GPU training, especially with Adam optimizer. There is no guarantee that you will get the same results if you follow any scaling law. You will need to try different numbers and figure it out.