Open liuqinglong110 opened 2 years ago
I haven't tried to use ekzhang's codebase, i know he split from ours a while back. He might be able to help here. @ekzhang.
Hi @liuqinglong110, the repository has a couple of small bugs because I ported over the code after leaving Nvidia and did not have the hardware to run it. It's more meant to be a guide. I don't know if you will be able to run it directly.
I believe another person was able to run the code though, here is their branch: https://github.com/yamengxi/semantic-segmentation/commit/338258100a1c38b36d82b3d53bb19ac078165dcf. There are two typos that they corrected in addition to their own configuration stuff:
If you could try that, maybe it would fix the shape mismatch issue you're seeing. I'll push this diff to the repository as well.
I used train mobilev3 small.yml for training, but I kept reporting errors.
The train mobilev3 small.yml is from: https://github.com/ekzhang/fastseg
CUDA_VISIBLE_DEVICES=2,3 python3 -m runx.runx scripts/train_mobilev3small.yml -i
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
None None Global Rank: 0 Local Rank: 0 Global Rank: 1 Local Rank: 1 Torch version: 1.6, 1.6.0+cu101 dataset = cityscapes ignore_label = 255 num_classes = 19 cv split val 2 ['train/monchengladbach', 'train/strasbourg', 'train/stuttgart'] mode val found 655 images cn num_classes 19 cv split train 2 ['val/lindau', 'val/munster', 'val/frankfurt', 'train/aachen', 'train/bochum', 'train/bremen', 'train/cologne', 'train/darmstadt', 'train/dusseldorf', 'train/erfurt', 'train/hamburg', 'train/hanover', 'train/jena', 'train/krefeld', 'train/tubingen', 'train/ulm', 'train/weimar', 'train/zurich'] mode train found 2820 images cn num_classes 19 Loading centroid file /app/uniform_centroids/cityscapes_cv2_tile1024.json Found 19 centroids Class Uniform Percentage: 0.5 Class Uniform items per Epoch: 2820 cls 0 len 5541 cls 1 len 4897 cls 2 len 5357 cls 3 len 1268 cls 4 len 1537 cls 5 len 5398 cls 6 len 2703 cls 7 len 4610 cls 8 len 5185 cls 9 len 2407 cls 10 len 4436 cls 11 len 3530 cls 12 len 1329 cls 13 len 4864 cls 14 len 415 cls 15 len 398 cls 16 len 183 cls 17 len 551 cls 18 len 2272 Using Cross Entropy Loss Trunk: mobilenetv3_small Model params = 1.1M Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Class Uniform Percentage: 0.5 Class Uniform items per Epoch: 2820 cls 0 len 5541 cls 1 len 4897 cls 2 len 5357 cls 3 len 1268 cls 4 len 1537 cls 5 len 5398 cls 6 len 2703 cls 7 len 4610 cls 8 len 5185 cls 9 len 2407 cls 10 len 4436 cls 11 len 3530 cls 12 len 1329 cls 13 len 4864 cls 14 len 415 cls 15 len 398 cls 16 len 183 cls 17 len 551 cls 18 len 2272 Traceback (most recent call last): File "train.py", line 601, in
main()
File "train.py", line 451, in main
train(train_loader, net, optim, epoch)
File "train.py", line 491, in train
main_loss = net(inputs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, kwargs)
File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/distributed.py", line 560, in forward
result = self.module(*inputs, *kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(input, kwargs)
File "/app/code/semantic-segmentation/network/lraspp.py", line 93, in forward
aspp = self.aspp_conv1(final) F.interpolate(
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(input, kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, *kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 419, in forward
return self._conv_forward(input, self.weight)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 416, in _conv_forward
self.padding, self.dilation, self.groups)
File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py", line 28, in wrapper
return orig_fn(new_args, kwargs)
RuntimeError: Given groups=1, weight of size [128, 256, 1, 1], expected input[2, 576, 128, 256] to have 256 channels, but got 576 channels instead
Traceback (most recent call last):
File "train.py", line 601, in
main()
File "train.py", line 451, in main
train(train_loader, net, optim, epoch)
File "train.py", line 491, in train
main_loss = net(inputs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, kwargs)
File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/distributed.py", line 560, in forward
result = self.module(*inputs, *kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(input, kwargs)
File "/app/code/semantic-segmentation/network/lraspp.py", line 93, in forward
aspp = self.aspp_conv1(final) F.interpolate(
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(input, kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, *kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 419, in forward
return self._conv_forward(input, self.weight)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 416, in _conv_forward
self.padding, self.dilation, self.groups)
File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/wrap.py", line 28, in wrapper
return orig_fn(new_args, kwargs)
RuntimeError: Given groups=1, weight of size [128, 256, 1, 1], expected input[2, 576, 128, 256] to have 256 channels, but got 576 channels instead
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 261, in
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'train.py', '--local_rank=1', '--dataset', 'cityscapes', '--cv', '2', '--arch', 'lraspp.MobileV3Small', '--max_cu_epoch', '150', '--max_epoch', '175', '--lr', '1e-2', '--lr_schedule', 'poly', '--poly_exp', '1.0', '--syncbn', '--optimizer', 'sgd', '--full_crop_training', '--apex', '--fp16', '--rmi_loss', '--result_dir', 'logs/train_mobilev3small/fastseg-cv2-lraspp.MobileV3Small_cocky-jaguar_2021.07.12_20.20']' returned non-zero exit status 1.