possible to wrap the teacher model by DistributedDataParallel?

facebookresearch / adaptive_teacher

This repo provides the source code for "Cross-Domain Adaptive Teacher for Object Detection".

Other

180 stars 35 forks source link

possible to wrap the teacher model by DistributedDataParallel? #35

Closed Weijiang-Xiong closed 1 year ago

Weijiang-Xiong commented 1 year ago

Hello, I'm trying to use your idea in my thesis work, thanks for your great idea and code! I set require_grad=False for all the parameters in the teacher model, and wrapped it in DistributedDataParallel. But what I got with my own code is that the training stucks at loss.backward(), the losses are not NaN. If I lower down the batch size and run with just 1 GPU, the code works fine. But if I use DistributedDataParallel, the training will stuck immediately.

Would you have an idea about it? Is it because the exponential moving average somehow affects the computation graph? Thanks

yujheli commented 1 year ago

Were you using my code or re-write your own code? I think the current version should not get stuck.

Weijiang-Xiong commented 1 year ago

Were you using my code or re-write your own code? I think the current version should not get stuck.

Thansk for the reply! I'm using my own implementation as I want to use the idea of adaptive teacher in single stage detectors like retinanet and fcos. I actually found the problem was caused by inconsistent gradient across GPUs, and that was a bug in my dummy novice code. I tried to filter the images by the number of instances in pseudo-labels, if the sub-batches in some GPUs is empty but others are not, then gradient is inconsistent and backward will hang.