Nice work and code! I have a little question about the teacher model in your code. I noticed that only the student model uses DDP. Hence, if I use 4 GPUs to train the model in the DDP method, there is only one teacher model in memory. But there are 4 processes (equal to GPUs) to use EMA to update the teacher model. In other words, in each iteration, the teacher model may be updated 4 times?
I don't found reduce or broadcast operation in the code.
Or maybe Detectron2 has some synchronize operation that I don't know.
Hope to get your reply! Thanks again for your excellent work!
Yes, we only apply DDP on student model, since the teacher is only used for inference. Model only used in inference mode is not allowed to use DDP, so we choose not to apply on teacher model.
Hi,
Nice work and code! I have a little question about the teacher model in your code. I noticed that only the student model uses DDP. Hence, if I use 4 GPUs to train the model in the DDP method, there is only one teacher model in memory. But there are 4 processes (equal to GPUs) to use EMA to update the teacher model. In other words, in each iteration, the teacher model may be updated 4 times?
I don't found reduce or broadcast operation in the code.
Or maybe Detectron2 has some synchronize operation that I don't know.
Hope to get your reply! Thanks again for your excellent work!