Online triplet mining for Ranking and Re-ranking stages

Anuradha-Uggi commented 1 year ago

Hi,

Good job! Thanks for releasing the code out. I just had one doubt. We have two dependent ranking stages here (Global retrieval and Reranker). If we opt for an online mining and want to refresh the cache for every iteration, which one do you suggest to me among 1. sample triplets (every image in a batch can be a +ve, -ve, and a query) using global features (256 length embeddings) and pass the same triplets to Reranker, and find reranker loss, and 2. Sample triplets separately for global embeddings and reranker local features and compute the separate losses?

Many thanks!

szhu-bytedance commented 1 year ago

I am not sure what do you mean by the strategy one, but you can definitely try both and see which one works better. Our sampling strategy is first train the Reranker with global hard sample, then train with partial hard sample.

Anuradha-Uggi commented 12 months ago

Okay. Thank you. I will try that out. I was trying an end-to-end training model. Keeping everything the same as the original code you provided here, except image resolution (changed from 480x640 to 320x240) and batch size (from 0.0001 to 0.00035), is drastically decreasing the validation recalls in the below order: epoch 1: rerank R@1: 55.1 epoch 2: rerank R@1: 46.1 epoch 3: rerank R@1: 21.8 epoch 4: rerank R@1: 7.6 Do you see why this could happen? Did you also observe this in your ablation? Please share your thoughts on this. I am excited to know where the bug could be. Many thanks!! Waiting for your reply!

szhu-bytedance commented 12 months ago

We did not observe this. I saw your resolution is 320x240, do you mean 240x320, as the original images are 480x640. We need to keep the aspect ratio unchanged. I guess you mean learning rate by using 0.00035. This may have a large impact on the performance. The performance of epoch1 should be 66.1 for R@1 and it keeps improving. You might need to tune the learning rate if you are using different batch size per GPU and number of GPUs.

Anuradha-Uggi commented 12 months ago

"I saw your resolution is 320x240, do you mean 240x320, as the original images are 480x640" : I haven't followed the aspect ratio. I will try correcting this.

" guess you mean learning rate by using 0.00035": This I followed: if new_batch_size = M x old_batch_size, then new_lr = sqrt(M) x old_lr, where M is a scalar. We don't have a deterministic relationship between lr and batch size, right? A few blogs say if bs increases, lr should be increased by following the above relation, and a few say if one increases, the other should decrease.

"The performance of epoch1 should be 66.1 for R@1 and it keeps improving.": This I may not see since the image resolution is brought down. This is my understanding. You can please correct me.

"You might need to tune the learning rate if you are using different batch size per GPU and number of GPUs.": I think, in the paper you wrote, you used 8 GPUs. So, I am using a single GPU. And the batch size you picked is 64. I used 8. Then 64/8 gpus = 8. Then the baseline lr (i.e., 0.0001) should work for a single GPU with 8 batch size, right? and the rerank_batch_size also I have reduced to 10 from 20.

Anuradha-Uggi commented 12 months ago

Below are the results with lr=0.0001 for batch size=8, ran on a single GPU. This is how the trend is: Epoch[0]: 65.1; Epoch[01]: 63.6; Epoch[02]: 60.5; Epoch[03]: 56.9; Epoch[04]: 51.4; Epoch[05]: 52.3; Epoch[06]: 53.2, Epoch[07]: 54.7, Epoch[08]: 51.8, Epoch[09]: 53.6 Epoch[10]: 54.2, Epoch[11]: 52.7, Epoch[12]: 52.0, Epoch[13]: 54.5

So, its fluctuating. Plz suggest on where I am going wrong.

szhu-bytedance commented 12 months ago

Did you correct the resolution issue? I suggest using a smaller lr.

Anuradha-Uggi commented 12 months ago

Yes, I changed to (240x320). The above trend is for 0.0001 (Assuming you used 8 GPUs with total batch size 64). In the beginning of the training code (train_reranker.py), you wrote, cuda_available_device='1,2,3,0'. So, How many did you use for 64 batch size. 8 or 4?.

'I suggest using a smaller lr': Trying with 0.0001/sqrt(2)=0.0000707. Should wait for the results.

Thanks!!

szhu-bytedance commented 12 months ago

The results do not seem to be right. I would suggest using a much smaller lr (e.g. 0.00001 ) first to ensure that the model is optimized properly. The "cuda_available_device='1,2,3,0'" is just comment. We use 8 GPUs for the training.

szhu-bytedance commented 12 months ago

The lr scaling law is for SGD. It is still unclear which lr is the best for Adam or AdamW when using different settings. I would suggest tuning the lr with x0.1 strategy.