why teacher model and student model have different operation?

WangYZ1608 / Knowledge-Distillation-via-ND

The official implementation for paper: Improving Knowledge Distillation via Regularizing Feature Norm and Direction

13 stars 2 forks source link

Closed HanGuangXin closed 1 year ago

HanGuangXin commented 1 year ago

When set args.distributed to Ture:

model = torch.nn.parallel.DistributedDataParallel(model)
teacher = torch.nn.DataParallel(teacher, device_ids=[0, 1, 2, 3, 4, 5, 6, 7])
teacher.cuda()

Why the student model uses torch.nn.parallel.DistributedDataParallel() but teacher model uses torch.nn.DataParallel.

WangYZ1608 commented 1 year ago

just need to set teacher.cuda(args.gpu), no need to use DP or DDP.