NVIDIA / semantic-segmentation

Nvidia Semantic Segmentation monorepo
BSD 3-Clause "New" or "Revised" License
1.76k stars 388 forks source link

IndexError: : tuple index out of range while running scripts/train_cityscapes.yml cached_x.grad_fn.next_functions[1][0].variable #169

Open doulemint opened 2 years ago

doulemint commented 2 years ago

None None Global Rank: 0 Local Rank: 0 Global Rank: 1 Local Rank: 1 Torch version: 1.1, 1.10.0+cu102 n scales [0.5, 1.0, 2.0] dataset = cityscapes ignore_label = 255 num_classes = 19 cv split val 0 ['val/lindau', 'val/frankfurt', 'val/munster'] mode val found 500 images cn num_classes 19 cv split train 0 ['train/aachen', 'train/bochum', 'train/bremen', 'train/cologne', 'train/darmstadt', 'train/dusseldorf', 'train/erfurt', 'train/hamburg', 'train/hanover', 'train/jena', 'train/krefeld', 'train/monchengladbach', 'train/strasbourg', 'train/stuttgart', 'train/tubingen', 'train/ulm', 'train/weimar', 'train/zurich'] mode train found 2975 images cn num_classes 19 Loading centroid file /home/Xiya/semantic-segmentation/assets/uniform_centroids/cityscapes_cv0_tile1024.json Found 19 centroids Class Uniform Percentage: 0.5 Class Uniform items per Epoch: 2975 cls 0 len 5866 cls 1 len 5184 cls 2 len 5678 cls 3 len 1312 cls 4 len 1723 cls 5 len 5656 cls 6 len 2769 cls 7 len 4860 cls 8 len 5388 cls 9 len 2440 cls 10 len 4722 cls 11 len 3719 cls 12 len 1239 cls 13 len 5075 cls 14 len 444 cls 15 len 348 cls 16 len 188 cls 17 len 575 cls 18 len 2238 Using Cross Entropy Loss Loading weights from: checkpoint=/home/Xiya/semantic-segmentation/assets//seg_weights/ocrnet.HRNet_industrious-chicken.pth => init weights from normal distribution => loading pretrained model /home/Xiya/semantic-segmentation/assets/seg_weights/hrnetv2_w48_imagenet_pretrained.pth Trunk: hrnetv2 Model params = 72.1M Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Skipped loading parameter module.ocr.cls_head.weight Skipped loading parameter module.ocr.cls_head.bias Skipped loading parameter module.ocr.aux_head.2.weight Skipped loading parameter module.ocr.aux_head.2.bias Skipped loading parameter module.scale_attn.conv0.weight Skipped loading parameter module.scale_attn.bn0.weight Skipped loading parameter module.scale_attn.bn0.bias Skipped loading parameter module.scale_attn.bn0.running_mean Skipped loading parameter module.scale_attn.bn0.running_var Skipped loading parameter module.scale_attn.bn0.num_batches_tracked Skipped loading parameter module.scale_attn.conv1.weight Skipped loading parameter module.scale_attn.bn1.weight Skipped loading parameter module.scale_attn.bn1.bias Skipped loading parameter module.scale_attn.bn1.running_mean Skipped loading parameter module.scale_attn.bn1.running_var Skipped loading parameter module.scale_attn.bn1.num_batches_tracked Skipped loading parameter module.scale_attn.conv2.weight Class Uniform Percentage: 0.5 Class Uniform items per Epoch: 2975 cls 0 len 5866 cls 1 len 5184 cls 2 len 5678 cls 3 len 1312 cls 4 len 1723 cls 5 len 5656 cls 6 len 2769 cls 7 len 4860 cls 8 len 5388 cls 9 len 2440 cls 10 len 4722 cls 11 len 3719 cls 12 len 1239 cls 13 len 5075 cls 14 len 444 cls 15 len 348 cls 16 len 188 cls 17 len 575 cls 18 len 2238 /home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/functional.py:3679: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn( /home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/functional.py:3679: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn( Traceback (most recent call last): File "train.py", line 601, in main() File "train.py", line 451, in main train(train_loader, net, optim, epoch) File "train.py", line 491, in train main_loss = net(inputs) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/apex/parallel/distributed.py", line 560, in forward result = self.module(*inputs, *kwargs) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/home/Xiya/semantic-segmentation/network/ocrnet.py", line 334, in forward Traceback (most recent call last): File "train.py", line 601, in main() File "train.py", line 451, in main train(train_loader, net, optim, epoch) File "train.py", line 491, in train main_loss = net(inputs) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/apex/parallel/distributed.py", line 560, in forward result = self.module(*inputs, *kwargs) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/home/Xiya/semantic-segmentation/network/ocrnet.py", line 334, in forward return self.two_scale_forward(inputs) File "/home/Xiya/semantic-segmentation/network/ocrnet.py", line 284, in two_scale_forward return self.two_scale_forward(inputs) File "/home/Xiya/semantic-segmentation/network/ocrnet.py", line 284, in two_scale_forward hi_outs = self._fwd(x_1x) File "/home/Xiya/semantic-segmentation/network/ocrnet.py", line 173, in _fwd hi_outs = self._fwd(x_1x) File "/home/Xiya/semantic-segmentation/network/ocrnet.py", line 173, in fwd , _, high_level_features = self.backbone(x) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _callimpl , _, high_level_features = self.backbone(x) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/home/Xiya/semantic-segmentation/network/hrnetv2.py", line 400, in forward return forward_call(*input, *kwargs) File "/home/Xiya/semantic-segmentation/network/hrnetv2.py", line 400, in forward x = self.conv1(x_in) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl x = self.conv1(x_in) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 446, in forward return forward_call(*input, **kwargs) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 446, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 442, in _conv_forward return self._conv_forward(input, self.weight, self.bias) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 442, in _conv_forward return F.conv2d(input, weight, bias, self.stride,return F.conv2d(input, weight, bias, self.stride,

File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/apex/amp/wrap.py", line 21, in wrapper File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/apex/amp/wrap.py", line 21, in wrapper args[i] = utils.cached_cast(cast_fn, args[i], handle.cache)args[i] = utils.cached_cast(cast_fn, args[i], handle.cache)

File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/apex/amp/utils.py", line 97, in cached_cast File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/apex/amp/utils.py", line 97, in cached_cast if cached_x.grad_fn.next_functions[1][0].variable is not x:if cached_x.grad_fn.next_functions[1][0].variable is not x:

IndexErrorIndexError: : tuple index out of rangetuple index out of range

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7285) of binary: /home/Xiya/anaconda/envs/py_seg/bin/python Traceback (most recent call last): File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/runpy.py", line 192, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/Xiya/anaconda/envs/py_seg/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures: [1]: time : 2021-10-23_07:58:29 host : ivslab2 rank : 1 (local_rank: 1) exitcode : -11 (pid: 7286) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 7286

Root Cause (first observed failure): [0]: time : 2021-10-23_07:58:29 host : ivslab2 rank : 0 (local_rank: 0) exitcode : 1 (pid: 7285) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

CUDA version: 10.2 apex: installed with cuda_ext enable to load the dataset how to fix it? has any other met this problem?

AxMM commented 2 years ago

Hello, I never met this problem. But looks like you are using python3.8 You should try with pytorch=1.3.0 python=3.6

doulemint commented 2 years ago

@AxMM Big Thanks to you for replying to my question!!!! I switched to python 3.6 and it doesn't work though. Instead, I turned down apex, trying to see where the real question is. It came out tensor type has inconsistency error. Have you ever met this problem? If I don't use fp16, setting its option false, my code is able to run but keep returning negative loss. if I use fp16, it keeps report inputs' tensor type is half but net's weights are float tensor type...

I try to change the input's tensor type to float, the code seems to expect the input tensor type to be half type at somewhere.......................... Thank you in advance for extending a hand

linzhiqiu commented 2 years ago

@AxMM Big Thanks to you for replying to my question!!!! I switched to python 3.6 and it doesn't work though. Instead, I turned down apex, trying to see where the real question is. It came out tensor type has inconsistency error. Have you ever met this problem? If I don't use fp16, setting its option false, my code is able to run but keep returning negative loss. if I use fp16, it keeps report inputs' tensor type is half but net's weights are float tensor type...

I try to change the input's tensor type to float, the code seems to expect the input tensor type to be half type at somewhere.......................... Thank you in advance for extending a hand

are you able to solve this issue? I am encountering the same...

hamzagorgulu commented 10 months ago

Having the same issue, anyone could solve it?