meet a bug when doing fastDistill

BaoWentz commented 1 year ago

Instructions To Reproduce the 🐛 Bug:

what changes you made (git diff) or what code you wrote When I trained a teacher model by using configs/Base-bagtricks.yml instead of projects/FastDistill/configs/sbs_r101ibn.yml before distillation. Then I don't want to train teacher model again and I meet a bug when trying to do distillation by current teacher model by using 4 GPUs. I find out that training images come from different cuda devices, but the teacher model always come from device 0. Finally I changed the code in ./fastreid/modeling/meta_arch/distiller.py
```
for model_t in self.model_ts:
  t_feat = model_t.backbone(images)
  t_output = model_t.heads(t_feat, targets)
  t_outputs.append(t_output)`
```
into
```
for model_t in self.model_ts:
  model_t.to(self.device)  # added this line.
  t_feat = model_t.backbone(images)
  t_output = model_t.heads(t_feat, targets)
  t_outputs.append(t_output)``.
```
This bug won't show anymore. I want to know if this is a bug, or I can fix this bug in a better way. Thanks!
what exact command you run: python3 projects/FastDistill/train_net.py --config-file ./projects/FastDistill/configs/no_net.yml --num-gpus 4

what you observed (including full logs):

<-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "./fast-reid/fastreid/engine/launch.py", line 103, in _distributed_worker
main_func(*args)
File "./fast-reid/projects/FastDistill/train_net.py", line 41, in main
return trainer.train()
File "./fast-reid/fastreid/engine/defaults.py", line 348, in train
super().train(self.start_epoch, self.max_epoch, self.iters_per_epoch)
File "./fast-reid/fastreid/engine/train_loop.py", line 145, in train
self.run_step()
File "./fast-reid/fastreid/engine/defaults.py", line 357, in run_step
self._trainer.run_step()
File "./fast-reid/fastreid/engine/train_loop.py", line 343, in run_step
loss_dict = self.model(data)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "./fast-reid/fastreid/modeling/meta_arch/distiller.py", line 104, in forward
t_feat = model_t.backbone(images)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "./fast-reid/fastreid/modeling/backbones/resnet.py", line 184, in forward
x = self.conv1(x)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 423, in forward
return self._conv_forward(input, self.weight)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 2 does not equal 0 (while checking arguments for cudnn_convolution)>

please simplify the steps as much as possible so they do not require additional resources to run, such as a private dataset.

Expected behavior:

If there are no obvious error in "what you observed" provided above, please tell us the expected behavior.

Environment:

Provide your environment information using the following command:

wget -nc -q https://github.com/facebookresearch/detectron2/raw/master/detectron2/utils/collect_env.py && python collect_env.py

If your issue looks like an installation issue / environment issue, please first try to solve it yourself with the instructions in

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

JDAI-CV / fast-reid