NVIDIA / semantic-segmentation

Nvidia Semantic Segmentation monorepo
BSD 3-Clause "New" or "Revised" License
1.76k stars 388 forks source link

RuntimeError: Given groups=1, weight of size [128, 256, 1, 1], expected input[2, 576, 128, 256] to have 256 channels, but got 576 channels instead #151

Open liuqinglong110 opened 2 years ago

liuqinglong110 commented 2 years ago

I used train mobilev3 small.yml for training, but I kept reporting errors.

The train mobilev3 small.yml is from: https://github.com/ekzhang/fastseg

CUDA_VISIBLE_DEVICES=2,3 python3 -m runx.runx scripts/train_mobilev3small.yml -i


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


None None Global Rank: 0 Local Rank: 0 Global Rank: 1 Local Rank: 1 Torch version: 1.6, 1.6.0+cu101 dataset = cityscapes ignore_label = 255 num_classes = 19 cv split val 2 ['train/monchengladbach', 'train/strasbourg', 'train/stuttgart'] mode val found 655 images cn num_classes 19 cv split train 2 ['val/lindau', 'val/munster', 'val/frankfurt', 'train/aachen', 'train/bochum', 'train/bremen', 'train/cologne', 'train/darmstadt', 'train/dusseldorf', 'train/erfurt', 'train/hamburg', 'train/hanover', 'train/jena', 'train/krefeld', 'train/tubingen', 'train/ulm', 'train/weimar', 'train/zurich'] mode train found 2820 images cn num_classes 19 Loading centroid file /app/uniform_centroids/cityscapes_cv2_tile1024.json Found 19 centroids Class Uniform Percentage: 0.5 Class Uniform items per Epoch: 2820 cls 0 len 5541 cls 1 len 4897 cls 2 len 5357 cls 3 len 1268 cls 4 len 1537 cls 5 len 5398 cls 6 len 2703 cls 7 len 4610 cls 8 len 5185 cls 9 len 2407 cls 10 len 4436 cls 11 len 3530 cls 12 len 1329 cls 13 len 4864 cls 14 len 415 cls 15 len 398 cls 16 len 183 cls 17 len 551 cls 18 len 2272 Using Cross Entropy Loss Trunk: mobilenetv3_small Model params = 1.1M Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Class Uniform Percentage: 0.5 Class Uniform items per Epoch: 2820 cls 0 len 5541 cls 1 len 4897 cls 2 len 5357 cls 3 len 1268 cls 4 len 1537 cls 5 len 5398 cls 6 len 2703 cls 7 len 4610 cls 8 len 5185 cls 9 len 2407 cls 10 len 4436 cls 11 len 3530 cls 12 len 1329 cls 13 len 4864 cls 14 len 415 cls 15 len 398 cls 16 len 183 cls 17 len 551 cls 18 len 2272 Traceback (most recent call last): File "train.py", line 601, in main() File "train.py", line 451, in main train(train_loader, net, optim, epoch) File "train.py", line 491, in train main_loss = net(inputs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/distributed.py", line 560, in forward result = self.module(*inputs, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/app/code/semantic-segmentation/network/lraspp.py", line 93, in forward aspp = self.aspp_conv1(final) F.interpolate( File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 419, in forward return self._conv_forward(input, self.weight) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 416, in _conv_forward self.padding, self.dilation, self.groups) File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py", line 28, in wrapper return orig_fn(new_args, kwargs) RuntimeError: Given groups=1, weight of size [128, 256, 1, 1], expected input[2, 576, 128, 256] to have 256 channels, but got 576 channels instead Traceback (most recent call last): File "train.py", line 601, in main() File "train.py", line 451, in main train(train_loader, net, optim, epoch) File "train.py", line 491, in train main_loss = net(inputs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/distributed.py", line 560, in forward result = self.module(*inputs, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/app/code/semantic-segmentation/network/lraspp.py", line 93, in forward aspp = self.aspp_conv1(final) F.interpolate( File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 419, in forward return self._conv_forward(input, self.weight) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 416, in _conv_forward self.padding, self.dilation, self.groups) File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py", line 28, in wrapper return orig_fn(new_args, kwargs) RuntimeError: Given groups=1, weight of size [128, 256, 1, 1], expected input[2, 576, 128, 256] to have 256 channels, but got 576 channels instead Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 261, in main() File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 257, in main cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=1', '--dataset', 'cityscapes', '--cv', '2', '--arch', 'lraspp.MobileV3Small', '--max_cu_epoch', '150', '--max_epoch', '175', '--lr', '1e-2', '--lr_schedule', 'poly', '--poly_exp', '1.0', '--syncbn', '--optimizer', 'sgd', '--full_crop_training', '--apex', '--fp16', '--rmi_loss', '--result_dir', 'logs/train_mobilev3small/fastseg-cv2-lraspp.MobileV3Small_cocky-jaguar_2021.07.12_20.20']' returned non-zero exit status 1.

karansapra commented 2 years ago

I haven't tried to use ekzhang's codebase, i know he split from ours a while back. He might be able to help here. @ekzhang.

ekzhang commented 2 years ago

Hi @liuqinglong110, the repository has a couple of small bugs because I ported over the code after leaving Nvidia and did not have the hardware to run it. It's more meant to be a guide. I don't know if you will be able to run it directly.

I believe another person was able to run the code though, here is their branch: https://github.com/yamengxi/semantic-segmentation/commit/338258100a1c38b36d82b3d53bb19ac078165dcf. There are two typos that they corrected in addition to their own configuration stuff:

image


image


If you could try that, maybe it would fix the shape mismatch issue you're seeing. I'll push this diff to the repository as well.