训练模型报错 - Githubissues

guo-pu commented 1 year ago

你好，我在训练自己的数据集，运行命令是：torchrun --nproc_per_node=1 tools/train_amp.py --config configs/bisenetv2_psv.py 出现如下报错 RuntimeError: The size of tensor a (75) must match the size of tensor b (76) at non-singleton dimension 3

详细报错信息如下： Traceback (most recent call last): File "tools/train_amp.py", line 210, in main() File "tools/train_amp.py", line 206, in main train() File "tools/train_amp.py", line 159, in train logits, logits_aux = net(im) File "/opt/conda/envs/park-net/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/opt/conda/envs/park-net/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/opt/conda/envs/park-net/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(inputs[0], kwargs[0]) File "/opt/conda/envs/park-net/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, *kwargs) File "/guopu/BiSeNet-master/./lib/models/bisenetv2.py", line 335, in forward feat_head = self.bga(feat_d, feat_s) File "/opt/conda/envs/park-net/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/guopu/BiSeNet-master/./lib/models/bisenetv2.py", line 277, in forward left = left1 torch.sigmoid(right1) RuntimeError: The size of tensor a (75) must match the size of tensor b (76) at non-singleton dimension 3 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6516) of binary: /opt/conda/envs/park-net/bin/python Traceback (most recent call last): File "/opt/conda/envs/park-net/bin/torchrun", line 8, in sys.exit(main()) File "/opt/conda/envs/park-net/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, kwargs) File "/opt/conda/envs/park-net/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/opt/conda/envs/park-net/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/opt/conda/envs/park-net/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/envs/park-net/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

输入的图片尺寸是600*600

请问可能是什么原因导致的呢，或如何解决呢

guo-pu commented 1 year ago

bisenetv2_psv.py 配置文件信息：

bisenetv2

cfg = dict( model_type='bisenetv2', n_cats=6, # n_classes num_aux_heads=4, lr_start=5e-3, weight_decay=5e-4, warmup_iters=1000, max_iter=150000, dataset='psv', im_root='./datasets/psv', train_im_anns='./datasets/psv/train.txt', val_im_anns='./datasets/psv/val.txt', scales=[0.25, 2.], cropsize=[600, 600], eval_crop=[600, 600], eval_scales=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75], ims_per_gpu=8, # batchsize eval_ims_per_gpu=8, # batchsize use_fp16=True, use_sync_bn=True, respth='./res', )

检测数据集中信息如下： check_dataset_info.py 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2550/2550 [00:30<00:00, 84.90it/s] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2550/2550 [00:16<00:00, 155.70it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2550/2550 [00:51<00:00, 49.92it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2550/2550 [02:14<00:00, 19.00it/s]

there are 2550 lines in ./datasets/psv/train.txt, which means 2550 image/label image pairs

max and min image shapes by area are: (600, 600), (600, 600) max and min image shapes by height are: (600, 600), (600, 600) max and min image shapes by width are: (600, 600), (600, 600)

we ignore label value of 255 in label images label values are within range of [0, 5] label values that are missing: [] ratios of each label value(from small to big, without ignored): [0.9748857037037038, 0.01653214705882353, 0.005505824618736384, 0.00045158823529411763, 0.0009825294117647059, 0.0016422069716775598]

pixel mean rgb: [0.4598615424836601, 0.45117859694989104, 0.4178958779956427] pixel std rgb: [0.22807834146953662, 0.2240104261501351, 0.21754879251373244]

请问上面出现的问题，可能是什么原因导致的呢，或如何解决呢

CoinCheung commented 1 year ago

Please make sure cropsize is divisible by 32, 600 is not a good choice, maybe you can use 608.

guo-pu commented 1 year ago

I'll give it a try, thanks

guo-pu commented 1 year ago

Thanks, it works after changing the image resolution.

CoinCheung commented 1 year ago

Good to know that your problem is solved. I am closing this.

CoinCheung / BiSeNet

训练模型报错 #291

bisenetv2