The problem with the training process.

rockywind commented 3 years ago

Thank you for your help in advance! When I run the script python train.py --cfg_file /newnfs/zzwu/08_3d_code/CaDDN-master/tools/cfgs/kitti_models/CaDDN.yaml

I met the error below.

2021-06-03 11:28:38,987 INFO **********************Start training newnfs/zzwu/08_3d_code/CaDDN-master/tools/cfgs/kitti_models/CaDDN(default)********************** epochs: 0%| | 0/80 [00:01<?, ?it/s] Traceback (most recent call last): | 0/3712 [00:00<?, ?it/s] File "train.py", line 201, in <module> main() File "train.py", line 173, in main merge_all_iters_to_one_epoch=args.merge_all_iters_to_one_epoch File "/newnfs/zzwu/08_3d_code/CaDDN-master/tools/train_utils/train_utils.py", line 93, in train_model dataloader_iter=dataloader_iter File "/newnfs/zzwu/08_3d_code/CaDDN-master/tools/train_utils/train_utils.py", line 38, in train_one_epoch loss, tb_dict, disp_dict = model_func(model, batch) File "/newnfs/zzwu/08_3d_code/CaDDN/pcdet/models/__init__.py", line 39, in model_func ret_dict, tb_dict, disp_dict = model(batch_dict) File "/home/CN/zizhang.wu/anaconda3/envs/CaDDN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/newnfs/zzwu/08_3d_code/CaDDN/pcdet/models/detectors/caddn.py", line 11, in forward batch_dict = cur_module(batch_dict) File "/home/CN/zizhang.wu/anaconda3/envs/CaDDN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/newnfs/zzwu/08_3d_code/CaDDN/pcdet/models/backbones_3d/ffe/depth_ffe.py", line 51, in forward ddn_result = self.ddn(images) File "/home/CN/zizhang.wu/anaconda3/envs/CaDDN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/newnfs/zzwu/08_3d_code/CaDDN/pcdet/models/backbones_3d/ffe/ddn/ddn_template.py", line 114, in forward x = self.model.classifier(x) File "/home/CN/zizhang.wu/anaconda3/envs/CaDDN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/CaDDN/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward input = module(input) File "/home/CN/zizhang.wu/anaconda3/envs/CaDDN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/CaDDN/lib/python3.7/site-packages/torchvision/models/segmentation/deeplabv3.py", line 92, in forward res.append(conv(x)) File "/home/CN/zizhang.wu/anaconda3/envs/CaDDN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/CaDDN/lib/python3.7/site-packages/torchvision/models/segmentation/deeplabv3.py", line 61, in forward x = mod(x) File "/home/CN/zizhang.wu/anaconda3/envs/CaDDN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/home/CN/zizhang.wu/anaconda3/envs/CaDDN/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 107, in forward exponential_average_factor, self.eps) File "/home/CN/zizhang.wu/anaconda3/envs/CaDDN/lib/python3.7/site-packages/torch/nn/functional.py", line 1666, in batch_norm raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size)) ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 256, 1, 1])

rockywind commented 3 years ago

My environment is: GPU: RTX3090, CUDA: cuda11.1 pytorch: pytorch1.7.1

codyreading commented 3 years ago

Yup this is an expected error. Unfortunately the DeepLabV3 Backbone doesn't allow you to run with a batch size of 1, so please change it to 2 or higher and this should work.

python train.py --cfg_file /newnfs/zzwu/08_3d_code/CaDDN-master/tools/cfgs/kitti_models/CaDDN.yaml --batch_size 2

rockywind commented 3 years ago

Thanks for your help. It's works.

TRAILab / CaDDN

The problem with the training process. #33