Closed fawazsammani closed 11 months ago
Thanks! It's actually pretty straightforward - I used a single GPU for training, so the per GPU batch size was equal to the global batch size.
As the loss is the DINO loss, which is an instance discrimination task I do think there is some benefit for optimisation to use a bigger batch size - however I doubt that this will truly change the prototypes learned. I haven't specifically looked at how sensitive the model is to batch size.
@Nanne i have another few questions,
1) You mention in the paper that you use soft gumbel softmax for the first 15 epochs and then switch to the hard gumbel softmax. However in this line the gumbel softmax for student is set to False and hard is set to True, before training starts. This will result in hard attention (w/o) gumbel softmax for all training epochs. The line here is technically doing nothing as hard is already set to True from the beginning of the training. Is that a bug in the code or should it be like this?
2) I assume the teacher is using the default setting of hard=False, gumbel=True through the whole training?
3) Are you taking the prototypes of the teacher or student at the end? It seems you use the student, while in the original DINO paper they say the teacher is better and they use the teacher.
Thanks in advance
Thanks, good catch, I did some experimenting with different configurations and left these lines in. To reproduce the paper line 174 and 175 should basically be removed. Will updated this!
Hi, Thanks for your work. In the paper, you mention you use a batch size of 128. But here in the code, it appears you are doing multi-GPU training, and the batch size per GPU is 128. Therefore, what is the global batch size you used? I am not sure if the model is sensitive to higher batch sizes since it is a prototype-based.
Regards