what changes you made (git diff) or what code you wrote
When I trained a teacher model by using configs/Base-bagtricks.yml instead of projects/FastDistill/configs/sbs_r101ibn.yml before distillation. Then I don't want to train teacher model again and I meet a bug when trying to do distillation by current teacher model by using 4 GPUs. I find out that training images come from different cuda devices, but the teacher model always come from device 0.
Finally I changed the code in ./fastreid/modeling/meta_arch/distiller.py
for model_t in self.model_ts:
t_feat = model_t.backbone(images)
t_output = model_t.heads(t_feat, targets)
t_outputs.append(t_output)`
into
for model_t in self.model_ts:
model_t.to(self.device) # added this line.
t_feat = model_t.backbone(images)
t_output = model_t.heads(t_feat, targets)
t_outputs.append(t_output)``.
This bug won't show anymore. I want to know if this is a bug, or I can fix this bug in a better way. Thanks!
what exact command you run:
python3 projects/FastDistill/train_net.py --config-file ./projects/FastDistill/configs/no_net.yml --num-gpus 4
what you observed (including full logs):
<-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "./fast-reid/fastreid/engine/launch.py", line 103, in _distributed_worker
main_func(*args)
File "./fast-reid/projects/FastDistill/train_net.py", line 41, in main
return trainer.train()
File "./fast-reid/fastreid/engine/defaults.py", line 348, in train
super().train(self.start_epoch, self.max_epoch, self.iters_per_epoch)
File "./fast-reid/fastreid/engine/train_loop.py", line 145, in train
self.run_step()
File "./fast-reid/fastreid/engine/defaults.py", line 357, in run_step
self._trainer.run_step()
File "./fast-reid/fastreid/engine/train_loop.py", line 343, in run_step
loss_dict = self.model(data)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "./fast-reid/fastreid/modeling/meta_arch/distiller.py", line 104, in forward
t_feat = model_t.backbone(images)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "./fast-reid/fastreid/modeling/backbones/resnet.py", line 184, in forward
x = self.conv1(x)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 423, in forward
return self._conv_forward(input, self.weight)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 2 does not equal 0 (while checking arguments for cudnn_convolution)>
please simplify the steps as much as possible so they do not require additional resources to
run, such as a private dataset.
Expected behavior:
If there are no obvious error in "what you observed" provided above,
please tell us the expected behavior.
Environment:
Provide your environment information using the following command:
Instructions To Reproduce the π Bug:
git diff
) or what code you wrote When I trained a teacher model by using configs/Base-bagtricks.yml instead of projects/FastDistill/configs/sbs_r101ibn.yml before distillation. Then I don't want to train teacher model again and I meet a bug when trying to do distillation by current teacher model by using 4 GPUs. I find out that training images come from different cuda devices, but the teacher model always come from device 0. Finally I changed the code in ./fastreid/modeling/meta_arch/distiller.pyinto
This bug won't show anymore. I want to know if this is a bug, or I can fix this bug in a better way. Thanks!
python3 projects/FastDistill/train_net.py --config-file ./projects/FastDistill/configs/no_net.yml --num-gpus 4
Expected behavior:
If there are no obvious error in "what you observed" provided above, please tell us the expected behavior.
Environment:
Provide your environment information using the following command:
If your issue looks like an installation issue / environment issue, please first try to solve it yourself with the instructions in