facebookresearch / detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
https://detectron2.readthedocs.io/en/latest/
Apache License 2.0
30.36k stars 7.46k forks source link

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. #4673

Closed qomhmd closed 1 year ago

qomhmd commented 1 year ago

Of note: I've read https://github.com/facebookresearch/detectron2/issues/3972 but it didn't help much.

Instructions To Reproduce the Issue:

  1. Full runnable code or full changes you made:

I tried to train DeepLabV3+ architecture with a customized config having ResNet18 (converted to .pkl from https://download.pytorch.org/models/resnet18-f37072fd.pth) as the backbone:

_BASE_: "detectron2/projects/DeepLab/configs/Cityscapes-SemanticSegmentation/deeplab_v3_plus_R_103_os16_mg124_poly_90k_bs16.yaml"
MODEL:
  WEIGHTS: "r18.pkl"
  BACKBONE:
    NAME: "build_resnet_backbone"
  RESNETS:
    DEPTH: 18
    RES2_OUT_CHANNELS: 64
    STEM_OUT_CHANNELS: 64
    RES5_DILATION: 1
    NUM_GROUPS: 1
  ROI_HEADS:
    NUM_CLASSES: 1

on MoNuSeg 2020 dataset.

  1. What exact command you run:
    
    cfg = get_cfg()
    add_deeplab_config(cfg)
    cfg.merge_from_file('/kaggle/input/deeplab-v3-plus-models/deeplab_v3_plus_R_18_os16_mg124_poly_90k_bs16.yaml')
    cfg.DATASETS.TRAIN = ('monuseg_train',)
    cfg.DATASETS.TEST = ()
    cfg.DATALOADER.NUM_WORKERS = 2
    cfg.SOLVER.IMS_PER_BATCH = 16
    cfg.SOLVER.BASE_LR = 0.01
    cfg.SOLVER.MAX_ITER = 300
    cfg.SOLVER.LR_SCHEDULER_NAME = 'WarmupMultiStepLR'
    cfg.SOLVER.STEPS = []
    cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True) trainer = DefaultTrainer(cfg) trainer.resume_or_load(resume=False) trainer.train()

I also used the following lines to also try `Trainer` code (from DeepLab project `train_net.py`):
```python
cfg.SOLVER.LR_SCHEDULER_NAME = 'WarmupPolyLR'
trainer = Trainer(cfg)
  1. Full logs or other relevant observations:
    SemanticSegmentor(
    (backbone): ResNet(
    (stem): BasicStem(
      (conv1): Conv2d(
        3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
        (norm): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (res2): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(
          64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(
          64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (res3): Sequential(
      (0): BasicBlock(
        (shortcut): Conv2d(
          64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False
          (norm): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv1): Conv2d(
          64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(
          128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (res4): Sequential(
      (0): BasicBlock(
        (shortcut): Conv2d(
          128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False
          (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv1): Conv2d(
          128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(
          256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (res5): Sequential(
      (0): BasicBlock(
        (shortcut): Conv2d(
          256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False
          (norm): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv1): Conv2d(
          256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(
          512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    )
    (sem_seg_head): DeepLabV3PlusHead(
    (decoder): ModuleDict(
      (res2): ModuleDict(
        (project_conv): Conv2d(
          64, 48, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): SyncBatchNorm(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (fuse_conv): Sequential(
          (0): Conv2d(
            304, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
            (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
          (1): Conv2d(
            256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
            (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
        )
      )
      (res5): ModuleDict(
        (project_conv): ASPP(
          (convs): ModuleList(
            (0): Conv2d(
              512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
              (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
            (1): Conv2d(
              512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(6, 6), dilation=(6, 6), bias=False
              (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
            (2): Conv2d(
              512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(12, 12), dilation=(12, 12), bias=False
              (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
            (3): Conv2d(
              512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(18, 18), dilation=(18, 18), bias=False
              (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
            (4): Sequential(
              (0): AvgPool2d(kernel_size=(16, 32), stride=1, padding=0)
              (1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
            )
          )
          (project): Conv2d(
            1280, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
        )
        (fuse_conv): None
      )
    )
    (predictor): Conv2d(256, 19, kernel_size=(1, 1), stride=(1, 1))
    (loss): DeepLabCE(
      (criterion): CrossEntropyLoss()
    )
    )
    )
    [11/20 16:16:07 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in training: [RandomCrop(crop_type='absolute', crop_size=[512, 1024]), ResizeShortestEdge(short_edge_length=(512, 768, 1024, 1280, 1536, 1792, 2048), max_size=4096, sample_style='choice'), RandomFlip()]
    [11/20 16:16:07 d2.data.build]: Using training sampler TrainingSampler
    [11/20 16:16:07 d2.data.common]: Serializing the dataset using: <class 'detectron2.data.common.NumpySerializedList'>
    [11/20 16:16:07 d2.data.common]: Serializing 37 elements to byte tensors and concatenating them all ...
    [11/20 16:16:07 d2.data.common]: Serialized dataset takes 0.01 MiB
    [11/20 16:16:12 d2.checkpoint.c2_model_loading]: Following weights matched with submodule backbone:
    | Names in Model    | Names in Checkpoint                                                               | Shapes                                    |
    |:------------------|:----------------------------------------------------------------------------------|:------------------------------------------|
    | res2.0.conv1.*    | res2.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,3,3)       |
    | res2.0.conv2.*    | res2.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,3,3)       |
    | res2.1.conv1.*    | res2.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,3,3)       |
    | res2.1.conv2.*    | res2.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,3,3)       |
    | res3.0.conv1.*    | res3.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,64,3,3)  |
    | res3.0.conv2.*    | res3.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,128,3,3) |
    | res3.0.shortcut.* | res3.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (128,) (128,) (128,) (128,) (128,64,1,1)  |
    | res3.1.conv1.*    | res3.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,128,3,3) |
    | res3.1.conv2.*    | res3.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,128,3,3) |
    | res4.0.conv1.*    | res4.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,128,3,3) |
    | res4.0.conv2.*    | res4.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3) |
    | res4.0.shortcut.* | res4.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (256,) (256,) (256,) (256,) (256,128,1,1) |
    | res4.1.conv1.*    | res4.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3) |
    | res4.1.conv2.*    | res4.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3) |
    | res5.0.conv1.*    | res5.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,256,3,3) |
    | res5.0.conv2.*    | res5.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,512,3,3) |
    | res5.0.shortcut.* | res5.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (512,) (512,) (512,) (512,) (512,256,1,1) |
    | res5.1.conv1.*    | res5.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,512,3,3) |
    | res5.1.conv2.*    | res5.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,512,3,3) |
    | stem.conv1.*      | stem.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}      | (64,) (64,) (64,) (64,) (64,3,7,7)        |
    [11/20 16:16:14 d2.engine.train_loop]: Starting training from iteration 0
    ERROR [11/20 16:16:22 d2.engine.train_loop]: Exception during training:
    Traceback (most recent call last):
    File "/opt/conda/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
    File "/opt/conda/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 494, in run_step
    self._trainer.run_step()
    File "/opt/conda/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 274, in run_step
    loss_dict = self.model(data)
    File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
    File "/opt/conda/lib/python3.7/site-packages/detectron2/modeling/meta_arch/semantic_seg.py", line 108, in forward
    features = self.backbone(images.tensor)
    File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
    File "/opt/conda/lib/python3.7/site-packages/detectron2/modeling/backbone/resnet.py", line 445, in forward
    x = self.stem(x)
    File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
    File "/opt/conda/lib/python3.7/site-packages/detectron2/modeling/backbone/resnet.py", line 356, in forward
    x = self.conv1(x)
    File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
    File "/opt/conda/lib/python3.7/site-packages/detectron2/layers/wrappers.py", line 117, in forward
    x = self.norm(x)
    File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
    File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 731, in forward
    world_size = torch.distributed.get_world_size(process_group)
    File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size
    return _get_group_size(group)
    File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size
    default_pg = _get_default_group()
    File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 430, in _get_default_group
    "Default process group has not been initialized, "
    RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
    [11/20 16:16:22 d2.engine.hooks]: Total training time: 0:00:08 (0:00:00 on hooks)
    [11/20 16:16:22 d2.utils.events]:  iter: 0    lr: N/A  max_mem: 7348M

Environment:

----------------------  -------------------------------------------------------------------------------
sys.platform            linux
Python                  3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) [GCC 9.4.0]
numpy                   1.21.6
detectron2              0.6 @/opt/conda/lib/python3.7/site-packages/detectron2
Compiler                GCC 9.4
CUDA compiler           CUDA 11.0
detectron2 arch flags   7.5
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.11.0 @/opt/conda/lib/python3.7/site-packages/torch
PyTorch debug build     False
GPU available           Yes
GPU 0,1                 Tesla T4 (arch=7.5)
Driver version          470.82.01
CUDA_HOME               /usr/local/cuda
Pillow                  9.1.1
torchvision             0.12.0 @/opt/conda/lib/python3.7/site-packages/torchvision
torchvision arch flags  3.7, 6.0, 7.0, 7.5
fvcore                  0.1.5.post20220512
iopath                  0.1.9
cv2                     4.5.4
----------------------  -------------------------------------------------------------------------------
PyTorch built with:
  - GCC 9.4
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2022.1-Product Build 20220311 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.0
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_70,code=compute_70;-gencode;arch=compute_75,code=compute_75
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.0, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

Testing NCCL connectivity ... this should not hang.
NCCL succeeded.
qomhmd commented 1 year ago

I've read https://github.com/facebookresearch/detectron2/issues/3972 but it didn't help much.

ppwwyyxx commented 1 year ago

Closing as duplicate of #3972. If the answer there is not clear for you, ask for clarification there instead.