error about train the model

xsu-yy commented 3 years ago

Thanks for the great works. when i train the model with MOT17 dataset by the following command:

python3 -m torch.distributed.launch --nproc_per_node=2 tools/train_net.py --config-file configs/dla/DLA_34_FPN_EMM_MOT17.yaml --train-dir my_train_results/MOT17_TEST/ --model-suffix pth

i got the error:

Traceback (most recent call last): File "tools/train_net.py", line 132, in main() File "tools/train_net.py", line 128, in main train(cfg, train_dir, args.local_rank, args.distributed, logger) File "tools/train_net.py", line 80, in train logger, tensorboard_writer File "./siammot/engine/trainer.py", line 51, in do_train result, loss_dict = model(images, targets) File "/home/sx/Documents/anaconda/anaconda3/envs/pt170/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/home/sx/Documents/anaconda/anaconda3/envs/pt170/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 619, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/sx/Documents/anaconda/anaconda3/envs/pt170/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/sx/Documents/anaconda/anaconda3/envs/pt170/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_initialize.py", line 197, in new_fwd applier(kwargs, input_caster)) File "./siammot/modelling/rcnn.py", line 47, in forward features = self.backbone(images.tensors) File "/home/sx/Documents/anaconda/anaconda3/envs/pt170/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/home/sx/Documents/anaconda/anaconda3/envs/pt170/lib/python3.6/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/home/sx/Documents/anaconda/anaconda3/envs/pt170/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "./siammot/modelling/backbone/dla.py", line 297, in forward x5 = self.level5(x4) File "/home/sx/Documents/anaconda/anaconda3/envs/pt170/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "./siammot/modelling/backbone/dla.py", line 231, in forward x1 = self.tree1(x, residual) File "/home/sx/Documents/anaconda/anaconda3/envs/pt170/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "./siammot/modelling/backbone/dla.py", line 54, in forward out += residual RuntimeError: The size of tensor a (47) must match the size of tensor b (46) at non-singleton dimension 3 can anybody help me ? thank you !

fengchaoqun9527 commented 3 years ago

I got the same error. So anybody knows how to solve it?

fengchaoqun9527 commented 3 years ago

This error seems to relate with batch_size.

bingshuai2019 commented 3 years ago

Can you double check your configuration file, and make sure that DATALOADER.SIZE_DIVISIBILITY = 32 in https://github.com/amazon-research/siam-mot/blob/main/configs/dla/DLA_34_FPN_EMM.yaml#L38

This is because the feature size mismatch between different layers in DLA backbone when the image size is not divisible by 32.

xsu-yy commented 3 years ago

Can you double check your configuration file, and make sure that DATALOADER.SIZE_DIVISIBILITY = 32 in https://github.com/amazon-research/siam-mot/blob/main/configs/dla/DLA_34_FPN_EMM.yaml#L38

This is because the feature size mismatch between different layers in DLA backbone when the image size is not divisible by 32.

Thank you，i have checked this parameter ,it is 32,and i have not modified。but I have solved this problem by change the batch size and i don't konw why this way can work.

milktean commented 2 years ago

hello ，i want to ask you for help! where i can change the batch size,since i do not see it in https://github.com/amazon-research/siam-mot/blob/main/configs/dla/DLA_34_FPN_EMM_MOT17.yaml

amazon-science / siam-mot

error about train the model #11