JDAI-CV / fast-reid

SOTA Re-identification Methods and Toolbox
Apache License 2.0
3.42k stars 837 forks source link

fastRetril #667

Closed rrjia closed 2 years ago

rrjia commented 2 years ago

python3 projects/FastRetri/train_net.py --config-file projects/FastRetri/configs/cub.yml --num-gpus 1

Traceback (most recent call last):
  File "./fastreid/engine/train_loop.py", line 145, in train
    self.run_step()
  File "./fastreid/engine/defaults.py", line 357, in run_step
    self._trainer.run_step()
  File "./fastreid/engine/train_loop.py", line 354, in run_step
    loss_dict = self.model(data)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./fastreid/modeling/meta_arch/baseline.py", line 112, in forward
    outputs = self.heads(features, targets)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./fastreid/modeling/heads/embedding_head.py", line 124, in forward
    neck_feat = self.bottleneck(pool_feat)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward
    world_size = torch.distributed.get_world_size(process_group)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 638, in get_world_size
    return _get_group_size(group)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size
    _check_default_pg()
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 211, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized
[06/07 18:36:18 fastreid.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks)
Traceback (most recent call last):
  File "projects/FastRetri/train_net.py", line 69, in <module>
    args=(args,),
  File "./fastreid/engine/launch.py", line 71, in launch
    main_func(*args)
  File "projects/FastRetri/train_net.py", line 57, in main
    return trainer.train()
  File "./fastreid/engine/defaults.py", line 348, in train
    super().train(self.start_epoch, self.max_epoch, self.iters_per_epoch)
  File "./fastreid/engine/train_loop.py", line 145, in train
    self.run_step()
  File "./fastreid/engine/defaults.py", line 357, in run_step
    self._trainer.run_step()
  File "./fastreid/engine/train_loop.py", line 354, in run_step
    loss_dict = self.model(data)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./fastreid/modeling/meta_arch/baseline.py", line 112, in forward
    outputs = self.heads(features, targets)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./fastreid/modeling/heads/embedding_head.py", line 124, in forward
    neck_feat = self.bottleneck(pool_feat)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward
    world_size = torch.distributed.get_world_size(process_group)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 638, in get_world_size
    return _get_group_size(group)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size
    _check_default_pg()
  File "/ssd7/exec/jiaruoran/anaconda3/envs/jiaruoran/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 211, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized
rrjia commented 2 years ago

已解决,将模型中的syncBN替换成正常的BN就可以正常训练了

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 2 years ago

This issue was closed because it has been inactive for 14 days since being marked as stale.