NVIDIA / semantic-segmentation

Nvidia Semantic Segmentation monorepo
BSD 3-Clause "New" or "Revised" License
1.77k stars 387 forks source link

ZeroDivisionError: float division by zero ,while training on custom data #111

Closed SupriyaB1 closed 3 years ago

SupriyaB1 commented 3 years ago

Hi, I am training HRNet model from scatch for custom dataset, I have loaded data as citycsapes format. I am training on Ubuntu 18.04(NC6).

Command : python3 -m torch.distributed.launch --nproc_per_node=1 /home/HRNet/semantic-segmentation/train.py --dataset cityscapes --cv 0 --bs_trn 1 --poly_exp 2 --lr 1e-2 --max_epoch 175 --max_cu_epoch 150 --n_scales "0.5,1.0,2.0" --supervised_mscale_loss_wt 0.05 --arch ocrnet.HRNet_Mscale --result_dir ./Save --custom_coarse_prob 0.5

None Global Rank: 0 Local Rank: 0 Torch version: 1.7, 1.7.0+cu101 n scales [0.5, 1.0, 2.0] dataset = cityscapes ignore_label = 255 num_classes = 1 cv split val 0 ['val/frankfurt'] mode val found 1 images cn num_classes 1 cv split train 0 ['train/aachen'] mode train found 4 images cn num_classes 1 Loading centroid file /home//HRNet/semantic-segmentation/large_data/uniform_centroids/cityscapes_cv0_tile1024.json Found 1 centroids Class Uniform Percentage: 0.5 Class Uniform items per Epoch: 4 cls 0 len 4 Using Cross Entropy Loss Warning: using Python fallback for SyncBatchNorm, possibly because apex was installed without --cuda_ext. The exception raised when attempting to import the cuda backend was: No module named 'syncbn' => init weights from normal distribution => loading pretrained model /home/HRNet/semantic-segmentation/large_data/seg_weights/hrnetv2_w48_imagenet_pretrained.pth Trunk: hrnetv2 Model params = 72.1M Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",) Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten. Class Uniform Percentage: 0.5 Class Uniform items per Epoch: 4 cls 0 len 4 Traceback (most recent call last): File "/home/HRNet/semantic-segmentation/train.py", line 601, in main() File "/home/HRNet/semantic-segmentation/train.py", line 451, in main train(train_loader, net, optim, epoch) File "/home/HRNet/semantic-segmentation/train.py", line 491, in train main_loss = net(inputs) File "/home/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/home/.local/lib/python3.6/site-packages/apex/parallel/distributed.py", line 560, in forward result = self.module(*inputs, *kwargs) File "/home/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/HRNet/semantic-segmentation/network/ocrnet.py", line 334, in forward return self.two_scale_forward(inputs) File "/home/HRNet/semantic-segmentation/network/ocrnet.py", line 277, in two_scale_forward lo_outs = self._fwd(x_lo) File "/home/HRNet/semantic-segmentation/network/ocrnet.py", line 174, in _fwd cls_out, aux_out, ocr_mid_feats = self.ocr(high_level_features) File "/home/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/home/HRNet/semantic-segmentation/network/ocrnet.py", line 89, in forward ocr_feats = self.ocr_distri_head(feats, context) File "/home/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/home/HRNet/semantic-segmentation/network/ocr_utils.py", line 150, in forward context = self.object_context_block(feats, proxy_feats) File "/home/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/HRNet/semantic-segmentation/network/ocr_utils.py", line 102, in forward key = self.f_object(proxy).view(batch_size, self.key_channels, -1) File "/home/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/home/.local/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/home/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/home/.local/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/home/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/.local/lib/python3.6/site-packages/apex/parallel/sync_batchnorm.py", line 130, in forward (m-1) self.momentum var + \ ZeroDivisionError: float division by zero Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', '/home/HRNet/semantic-segmentation/train.py', '--local_rank=0', '--dataset', 'cityscapes', '--cv', '0', '--bs_trn', '1', '--poly_exp', '2', '--lr', '1e-2', '--max_epoch', '175', '--max_cu_epoch', '150', '--n_scales', '0.5,1.0,2.0', '--supervised_mscale_loss_wt', '0.05', '--arch', 'ocrnet.HRNet_Mscale', '--result_dir', './TrainT', '--custom_coarse_prob', '0.5']' returned non-zero exit status 1.

Can you please help me to solve this error Thank you

karansapra commented 3 years ago

Are you running batch size 1 on 1 gpu ?

SupriyaB1 commented 3 years ago

Yes @karansapra

karansapra commented 3 years ago

You cant use bn with batch size = 1

karansapra commented 3 years ago

Please increase bs=2 or use multiple gpus. hope that helps!