Closed qomhmd closed 1 year ago
Of note: I've read https://github.com/facebookresearch/detectron2/issues/3972 but it didn't help much.
I tried to train DeepLabV3+ architecture with a customized config having ResNet18 (converted to .pkl from https://download.pytorch.org/models/resnet18-f37072fd.pth) as the backbone:
.pkl
_BASE_: "detectron2/projects/DeepLab/configs/Cityscapes-SemanticSegmentation/deeplab_v3_plus_R_103_os16_mg124_poly_90k_bs16.yaml" MODEL: WEIGHTS: "r18.pkl" BACKBONE: NAME: "build_resnet_backbone" RESNETS: DEPTH: 18 RES2_OUT_CHANNELS: 64 STEM_OUT_CHANNELS: 64 RES5_DILATION: 1 NUM_GROUPS: 1 ROI_HEADS: NUM_CLASSES: 1
on MoNuSeg 2020 dataset.
cfg = get_cfg() add_deeplab_config(cfg) cfg.merge_from_file('/kaggle/input/deeplab-v3-plus-models/deeplab_v3_plus_R_18_os16_mg124_poly_90k_bs16.yaml') cfg.DATASETS.TRAIN = ('monuseg_train',) cfg.DATASETS.TEST = () cfg.DATALOADER.NUM_WORKERS = 2 cfg.SOLVER.IMS_PER_BATCH = 16 cfg.SOLVER.BASE_LR = 0.01 cfg.SOLVER.MAX_ITER = 300 cfg.SOLVER.LR_SCHEDULER_NAME = 'WarmupMultiStepLR' cfg.SOLVER.STEPS = [] cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True) trainer = DefaultTrainer(cfg) trainer.resume_or_load(resume=False) trainer.train()
I also used the following lines to also try `Trainer` code (from DeepLab project `train_net.py`): ```python cfg.SOLVER.LR_SCHEDULER_NAME = 'WarmupPolyLR' trainer = Trainer(cfg)
SemanticSegmentor( (backbone): ResNet( (stem): BasicStem( (conv1): Conv2d( 3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False (norm): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (res2): Sequential( (0): BasicBlock( (conv1): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) ) (res3): Sequential( (0): BasicBlock( (shortcut): Conv2d( 64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (conv1): Conv2d( 64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False (norm): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) ) (res4): Sequential( (0): BasicBlock( (shortcut): Conv2d( 128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (conv1): Conv2d( 128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) ) (res5): Sequential( (0): BasicBlock( (shortcut): Conv2d( 256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (conv1): Conv2d( 256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False (norm): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) ) ) (sem_seg_head): DeepLabV3PlusHead( (decoder): ModuleDict( (res2): ModuleDict( (project_conv): Conv2d( 64, 48, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): SyncBatchNorm(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (fuse_conv): Sequential( (0): Conv2d( 304, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (1): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) ) (res5): ModuleDict( (project_conv): ASPP( (convs): ModuleList( (0): Conv2d( 512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (1): Conv2d( 512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(6, 6), dilation=(6, 6), bias=False (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (2): Conv2d( 512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(12, 12), dilation=(12, 12), bias=False (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (3): Conv2d( 512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(18, 18), dilation=(18, 18), bias=False (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (4): Sequential( (0): AvgPool2d(kernel_size=(16, 32), stride=1, padding=0) (1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1)) ) ) (project): Conv2d( 1280, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (fuse_conv): None ) ) (predictor): Conv2d(256, 19, kernel_size=(1, 1), stride=(1, 1)) (loss): DeepLabCE( (criterion): CrossEntropyLoss() ) ) ) [11/20 16:16:07 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in training: [RandomCrop(crop_type='absolute', crop_size=[512, 1024]), ResizeShortestEdge(short_edge_length=(512, 768, 1024, 1280, 1536, 1792, 2048), max_size=4096, sample_style='choice'), RandomFlip()] [11/20 16:16:07 d2.data.build]: Using training sampler TrainingSampler [11/20 16:16:07 d2.data.common]: Serializing the dataset using: <class 'detectron2.data.common.NumpySerializedList'> [11/20 16:16:07 d2.data.common]: Serializing 37 elements to byte tensors and concatenating them all ... [11/20 16:16:07 d2.data.common]: Serialized dataset takes 0.01 MiB [11/20 16:16:12 d2.checkpoint.c2_model_loading]: Following weights matched with submodule backbone: | Names in Model | Names in Checkpoint | Shapes | |:------------------|:----------------------------------------------------------------------------------|:------------------------------------------| | res2.0.conv1.* | res2.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (64,) (64,) (64,) (64,) (64,64,3,3) | | res2.0.conv2.* | res2.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (64,) (64,) (64,) (64,) (64,64,3,3) | | res2.1.conv1.* | res2.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (64,) (64,) (64,) (64,) (64,64,3,3) | | res2.1.conv2.* | res2.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (64,) (64,) (64,) (64,) (64,64,3,3) | | res3.0.conv1.* | res3.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (128,) (128,) (128,) (128,) (128,64,3,3) | | res3.0.conv2.* | res3.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (128,) (128,) (128,) (128,) (128,128,3,3) | | res3.0.shortcut.* | res3.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (128,) (128,) (128,) (128,) (128,64,1,1) | | res3.1.conv1.* | res3.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (128,) (128,) (128,) (128,) (128,128,3,3) | | res3.1.conv2.* | res3.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (128,) (128,) (128,) (128,) (128,128,3,3) | | res4.0.conv1.* | res4.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (256,) (256,) (256,) (256,) (256,128,3,3) | | res4.0.conv2.* | res4.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (256,) (256,) (256,) (256,) (256,256,3,3) | | res4.0.shortcut.* | res4.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (256,) (256,) (256,) (256,) (256,128,1,1) | | res4.1.conv1.* | res4.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (256,) (256,) (256,) (256,) (256,256,3,3) | | res4.1.conv2.* | res4.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (256,) (256,) (256,) (256,) (256,256,3,3) | | res5.0.conv1.* | res5.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (512,) (512,) (512,) (512,) (512,256,3,3) | | res5.0.conv2.* | res5.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (512,) (512,) (512,) (512,) (512,512,3,3) | | res5.0.shortcut.* | res5.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (512,) (512,) (512,) (512,) (512,256,1,1) | | res5.1.conv1.* | res5.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (512,) (512,) (512,) (512,) (512,512,3,3) | | res5.1.conv2.* | res5.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (512,) (512,) (512,) (512,) (512,512,3,3) | | stem.conv1.* | stem.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (64,) (64,) (64,) (64,) (64,3,7,7) | [11/20 16:16:14 d2.engine.train_loop]: Starting training from iteration 0 ERROR [11/20 16:16:22 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/opt/conda/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 494, in run_step self._trainer.run_step() File "/opt/conda/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 274, in run_step loss_dict = self.model(data) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/detectron2/modeling/meta_arch/semantic_seg.py", line 108, in forward features = self.backbone(images.tensor) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/detectron2/modeling/backbone/resnet.py", line 445, in forward x = self.stem(x) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/detectron2/modeling/backbone/resnet.py", line 356, in forward x = self.conv1(x) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/detectron2/layers/wrappers.py", line 117, in forward x = self.norm(x) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 731, in forward world_size = torch.distributed.get_world_size(process_group) File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size return _get_group_size(group) File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size default_pg = _get_default_group() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 430, in _get_default_group "Default process group has not been initialized, " RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. [11/20 16:16:22 d2.engine.hooks]: Total training time: 0:00:08 (0:00:00 on hooks) [11/20 16:16:22 d2.utils.events]: iter: 0 lr: N/A max_mem: 7348M
GPU T4 x2
GPU P100
---------------------- ------------------------------------------------------------------------------- sys.platform linux Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) [GCC 9.4.0] numpy 1.21.6 detectron2 0.6 @/opt/conda/lib/python3.7/site-packages/detectron2 Compiler GCC 9.4 CUDA compiler CUDA 11.0 detectron2 arch flags 7.5 DETECTRON2_ENV_MODULE <not set> PyTorch 1.11.0 @/opt/conda/lib/python3.7/site-packages/torch PyTorch debug build False GPU available Yes GPU 0,1 Tesla T4 (arch=7.5) Driver version 470.82.01 CUDA_HOME /usr/local/cuda Pillow 9.1.1 torchvision 0.12.0 @/opt/conda/lib/python3.7/site-packages/torchvision torchvision arch flags 3.7, 6.0, 7.0, 7.5 fvcore 0.1.5.post20220512 iopath 0.1.9 cv2 4.5.4 ---------------------- ------------------------------------------------------------------------------- PyTorch built with: - GCC 9.4 - C++ Version: 201402 - Intel(R) oneAPI Math Kernel Library Version 2022.1-Product Build 20220311 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e) - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - NNPACK is enabled - CPU capability usage: AVX512 - CUDA Runtime 11.0 - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_70,code=compute_70;-gencode;arch=compute_75,code=compute_75 - CuDNN 8.0.5 - Magma 2.5.2 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.0, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, Testing NCCL connectivity ... this should not hang. NCCL succeeded.
---------------------- ---------------------------------------------------------------- sys.platform linux Python 3.7.15 (default, Oct 12 2022, 19:14:55) [GCC 7.5.0] numpy 1.21.6 detectron2 0.6 @/usr/local/lib/python3.7/dist-packages/detectron2 Compiler GCC 7.5 CUDA compiler CUDA 11.2 detectron2 arch flags 7.5 DETECTRON2_ENV_MODULE <not set> PyTorch 1.12.1+cu113 @/usr/local/lib/python3.7/dist-packages/torch PyTorch debug build False GPU available Yes GPU 0 Tesla T4 (arch=7.5) Driver version 460.32.03 CUDA_HOME /usr/local/cuda Pillow 7.1.2 torchvision 0.13.1+cu113 @/usr/local/lib/python3.7/dist-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20220512 iopath 0.1.9 cv2 4.6.0 ---------------------- ---------------------------------------------------------------- PyTorch built with: - GCC 9.3 - C++ Version: 201402 - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815) - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - NNPACK is enabled - CPU capability usage: AVX2 - CUDA Runtime 11.3 - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86 - CuDNN 8.3.2 (built against CUDA 11.5) - Magma 2.5.2 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
I've read https://github.com/facebookresearch/detectron2/issues/3972 but it didn't help much.
Closing as duplicate of #3972. If the answer there is not clear for you, ask for clarification there instead.
Of note: I've read https://github.com/facebookresearch/detectron2/issues/3972 but it didn't help much.
Instructions To Reproduce the Issue:
I tried to train DeepLabV3+ architecture with a customized config having ResNet18 (converted to
.pkl
from https://download.pytorch.org/models/resnet18-f37072fd.pth) as the backbone:on MoNuSeg 2020 dataset.
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True) trainer = DefaultTrainer(cfg) trainer.resume_or_load(resume=False) trainer.train()
Environment:
GPU T4 x2
andGPU P100
: