JDAI-CV / fast-reid

SOTA Re-identification Methods and Toolbox
Apache License 2.0
3.39k stars 830 forks source link

meet a bug when doing fastDistill #690

Closed BaoWentz closed 1 year ago

BaoWentz commented 1 year ago

Instructions To Reproduce the πŸ› Bug:

  1. what changes you made (git diff) or what code you wrote When I trained a teacher model by using configs/Base-bagtricks.yml instead of projects/FastDistill/configs/sbs_r101ibn.yml before distillation. Then I don't want to train teacher model again and I meet a bug when trying to do distillation by current teacher model by using 4 GPUs. I find out that training images come from different cuda devices, but the teacher model always come from device 0. Finally I changed the code in ./fastreid/modeling/meta_arch/distiller.py
    for model_t in self.model_ts:
      t_feat = model_t.backbone(images)
      t_output = model_t.heads(t_feat, targets)
      t_outputs.append(t_output)`

    into

    for model_t in self.model_ts:
      model_t.to(self.device)  # added this line.
      t_feat = model_t.backbone(images)
      t_output = model_t.heads(t_feat, targets)
      t_outputs.append(t_output)``.

    This bug won't show anymore. I want to know if this is a bug, or I can fix this bug in a better way. Thanks!

  2. what exact command you run: python3 projects/FastDistill/train_net.py --config-file ./projects/FastDistill/configs/no_net.yml --num-gpus 4
  3. what you observed (including full logs):
    <-- Process 2 terminated with the following error:
    Traceback (most recent call last):
    File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
    File "./fast-reid/fastreid/engine/launch.py", line 103, in _distributed_worker
    main_func(*args)
    File "./fast-reid/projects/FastDistill/train_net.py", line 41, in main
    return trainer.train()
    File "./fast-reid/fastreid/engine/defaults.py", line 348, in train
    super().train(self.start_epoch, self.max_epoch, self.iters_per_epoch)
    File "./fast-reid/fastreid/engine/train_loop.py", line 145, in train
    self.run_step()
    File "./fast-reid/fastreid/engine/defaults.py", line 357, in run_step
    self._trainer.run_step()
    File "./fast-reid/fastreid/engine/train_loop.py", line 343, in run_step
    loss_dict = self.model(data)
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/distributed.py", line 619, in forward
    output = self.module(*inputs[0], **kwargs[0])
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
    File "./fast-reid/fastreid/modeling/meta_arch/distiller.py", line 104, in forward
    t_feat = model_t.backbone(images)
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
    File "./fast-reid/fastreid/modeling/backbones/resnet.py", line 184, in forward
    x = self.conv1(x)
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
    File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
    self.padding, self.dilation, self.groups)
    RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 2 does not equal 0 (while checking arguments for cudnn_convolution)>
  4. please simplify the steps as much as possible so they do not require additional resources to run, such as a private dataset.

Expected behavior:

If there are no obvious error in "what you observed" provided above, please tell us the expected behavior.

Environment:

Provide your environment information using the following command:

wget -nc -q https://github.com/facebookresearch/detectron2/raw/master/detectron2/utils/collect_env.py && python collect_env.py

If your issue looks like an installation issue / environment issue, please first try to solve it yourself with the instructions in

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale.