batch size - Githubissues

Nanne / ProtoSim

Code and instructions accompanying ICCV'23 paper Protoype-based Dataset Comparison

https://nanne.github.io/ProtoSim/

17 stars 1 forks source link

batch size #1

Closed fawazsammani closed 11 months ago

fawazsammani commented 11 months ago

Hi, Thanks for your work. In the paper, you mention you use a batch size of 128. But here in the code, it appears you are doing multi-GPU training, and the batch size per GPU is 128. Therefore, what is the global batch size you used? I am not sure if the model is sensitive to higher batch sizes since it is a prototype-based.

Regards

Nanne commented 11 months ago

Thanks! It's actually pretty straightforward - I used a single GPU for training, so the per GPU batch size was equal to the global batch size.

As the loss is the DINO loss, which is an instance discrimination task I do think there is some benefit for optimisation to use a bigger batch size - however I doubt that this will truly change the prototypes learned. I haven't specifically looked at how sensitive the model is to batch size.

fawazsammani commented 11 months ago

@Nanne i have another few questions,

1) You mention in the paper that you use soft gumbel softmax for the first 15 epochs and then switch to the hard gumbel softmax. However in this line the gumbel softmax for student is set to False and hard is set to True, before training starts. This will result in hard attention (w/o) gumbel softmax for all training epochs. The line here is technically doing nothing as hard is already set to True from the beginning of the training. Is that a bug in the code or should it be like this?

2) I assume the teacher is using the default setting of hard=False, gumbel=True through the whole training?

3) Are you taking the prototypes of the teacher or student at the end? It seems you use the student, while in the original DINO paper they say the teacher is better and they use the teacher.

Thanks in advance

Nanne commented 11 months ago

Thanks, good catch, I did some experimenting with different configurations and left these lines in. To reproduce the paper line 174 and 175 should basically be removed. Will updated this!

Correct, for the student the switch is made, for the teacher it is not.
I did not explore this in-depth, but I explicitly wanted the hard prototypes and thus used the student. I would expect that switching the teacher to hard gumbel-softmax would harm training (because knowledge distillation works better with soft targets).