Closed daidaiershidi closed 2 years ago
Hi, thank you for your interest of our work. I think the number of iterations of your configuration is twice of it in our original configuration, so I believe the solution will be: 1) reduce the total epochs, as well as the number of epoch when freezing BERT, deleting contrastive loss, etc; or 2) accumulate gradients and update optimizer each two steps with a normalization term of losses (e.g., multiple 1/2). Note that Charades is the smallest dataset in this task, so a little performance fluctuation is common. I believe performance gap less than 0.5 will be a good reproduction. For further questions, please feel free to comment here.
I have adopted the second suggestion you provided ('accumulate gradients and update optimizer each two steps with a normalization term of losses (e.g., multiple 1/2)'). At the same time, gradient accumulation is often accompanied by improved learning rates. Because the number of rounds of gradient accumulation is 2, learning_rate = original_learning_rate * sqrt(2). Finally, I get similar results. Thank you for your help. It has taught me a lot.
Thank you for proposing a very interesting work. On Charades, since the original number of GPUs is 4 and the original batchsize is 48, I set batchsize as 24 in two 3090 for keeping the same samples on each GPU. Other configurations remain the same. However, I get the score are
The excessive gap confuses me. So, what was your training environment, and if I don't have 4 GPUs, is there any way to get the score in the paper? Looking forward to your reply.