Closed neverstoplearn closed 6 months ago
when I train ,I got this error,how can i fix it? thanks.
when I train ,I got this error,how can i fix it? thanks.
@neverstoplearn
Alright, so when you're training your model, you gotta make sure Batch Normalization, or BN for short, has enough data to play with in each bunch, meaning more than one per bunch. You've got 1201 pieces of training data here, and if you're grouping them in batches of 4, the last group ends up being a loner – just one data. That doesn't sit well with BN, it gets confused. The easy way is to deletite one data, buddy. Not perfect, a bit sloppy maybe, but it gets the job done.
@neverstoplearn
2024-05-04 15:59:02,926 - INFO - data_dir : /datasets/QNRF-Train-Val-Test 2024-05-04 15:59:02,926 - INFO - dataset : qnrf 2024-05-04 15:59:02,926 - INFO - arch : FFNet 2024-05-04 15:59:02,926 - INFO - lr : 1e-05 2024-05-04 15:59:02,926 - INFO - eta_min : 1e-05 2024-05-04 15:59:02,926 - INFO - weight_decay : 0 2024-05-04 15:59:02,926 - INFO - resume : 2024-05-04 15:59:02,926 - INFO - max_epoch : 2000 2024-05-04 15:59:02,926 - INFO - val_epoch : 1 2024-05-04 15:59:02,926 - INFO - val_start : 500 2024-05-04 15:59:02,926 - INFO - batch_size : 4 2024-05-04 15:59:02,926 - INFO - device : 0 2024-05-04 15:59:02,926 - INFO - num_workers : 16 2024-05-04 15:59:02,926 - INFO - crop_size : 512 2024-05-04 15:59:02,926 - INFO - wot : 0.1 2024-05-04 15:59:02,926 - INFO - wtv : 0.01 2024-05-04 15:59:02,926 - INFO - reg : 10.0 2024-05-04 15:59:02,926 - INFO - num_of_iter_in_ot: 100 2024-05-04 15:59:02,926 - INFO - norm_cood : 0 2024-05-04 15:59:02,926 - INFO - run_name : FFNet-16-1e-5_1e-5-4_1-21 2024-05-04 15:59:02,926 - INFO - wandb : 0 2024-05-04 15:59:02,926 - INFO - seed : 21 2024-05-04 15:59:02,976 - INFO - using 1 gpus number of img: 1080 number of img: 120 2024-05-04 15:59:04,014 - INFO - random initialization 2024-05-04 15:59:04,014 - INFO - -----Epoch 0/2000----- /home/deeplearn/JupyterlabRoot/erdongsanshi/FFNet/losses/bregmanpytorch.py:173: UserWarning: An output with one or more elements was resized since it had shape [4096], which does not match the required output shape [1, 4096]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize(0). (Triggered internally at ../aten/src/ATen/native/Resize.cpp:17.) torch.matmul(u, K, out=KTu) 2024-05-04 15:59:46,367 - INFO - Epoch 0 Train, Loss: 142.83, OT Loss: -1.48e-07, Wass Distance: 234.76, OT obj value: 67.98, Count Loss: 140.64, TV Loss: 2.19, MSE: 239.89 MAE: 140.64, Cost 42.4 sec
@neverstoplearn I analyzed the structure of my model, and I have dynamic convolution in the neck of the model, which requires every batch of data to be greater than 1, which should also be relevant. But the root cause is that the last batch only has one data. You can solve the problem in this way
thanks,I sovle it.
2024-04-29 00:26:44,217 - INFO - data_dir : ./QNRF 2024-04-29 00:26:44,218 - INFO - dataset : qnrf 2024-04-29 00:26:44,218 - INFO - arch : FFNet 2024-04-29 00:26:44,218 - INFO - lr : 1e-05 2024-04-29 00:26:44,218 - INFO - eta_min : 1e-05 2024-04-29 00:26:44,218 - INFO - weight_decay : 0 2024-04-29 00:26:44,218 - INFO - resume :
trainer.train()
File "/home/user/zx/FFNet/train_helper_FFNet.py", line 181, in train
self.train_epoch()
File "/home/user/zx/FFNet/train_helper_FFNet.py", line 205, in train_epoch
outputs, outputs_normed = self.model(inputs)
File "/home/user/anaconda3/envs/intern/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/home/user/anaconda3/envs/intern/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/home/user/zx/FFNet/Networks/FFNet.py", line 159, in forward
pool1 = self.ccsm1(pool1)
File "/home/user/anaconda3/envs/intern/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/home/user/anaconda3/envs/intern/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, kwargs)
File "/home/user/zx/FFNet/Networks/FFNet.py", line 115, in forward
x = self.conv1(x)
File "/home/user/anaconda3/envs/internLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/user/anaconda3/envs/internLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, kwargs)
File "/home/user/anaconda3/envs/internLM/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward
input = module(input)
File "/home/user/anaconda3/envs/internLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/home/user/anaconda3/envs/intern/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/home/user/zx/FFNet/Networks/ODConv2d.py", line 141, in forward
return self._forward_impl(x)
File "/home/user/zx/FFNet/Networks/ODConv2d.py", line 119, in _forward_impl_common
channel_attention, filter_attention, spatial_attention, kernel_attention = self.attention(x)
File "/home/user/anaconda3/envs/intern/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/home/user/anaconda3/envs/intern/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, kwargs)
File "/home/user/zx/FFNet/Networks/ODConv2d.py", line 81, in forward
x = self.bn(x)
File "/home/user/anaconda3/envs/intern/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/user/anaconda3/envs/intern/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, kwargs)
File "/home/user/anaconda3/envs/intern/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py", line 171, in forward
return F.batch_norm(
File "/home/user/anaconda3/envs/intern/lib/python3.10/site-packages/torch/nn/functional.py", line 2476, in batch_norm
_verify_batch_size(input.size())
File "/home/user/anaconda3/envs/intern/lib/python3.10/site-packages/torch/nn/functional.py", line 2444, in _verify_batch_size
raise ValueError(f"Expected more than 1 value per channel when training, got input size {size}")
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 16, 1, 1])
2024-04-29 00:26:44,218 - INFO - max_epoch : 2000 2024-04-29 00:26:44,218 - INFO - val_epoch : 1 2024-04-29 00:26:44,218 - INFO - val_start : 500 2024-04-29 00:26:44,218 - INFO - batch_size : 4 2024-04-29 00:26:44,218 - INFO - device : 0 2024-04-29 00:26:44,218 - INFO - num_workers : 16 2024-04-29 00:26:44,218 - INFO - crop_size : 512 2024-04-29 00:26:44,218 - INFO - wot : 0.1 2024-04-29 00:26:44,218 - INFO - wtv : 0.01 2024-04-29 00:26:44,218 - INFO - reg : 10.0 2024-04-29 00:26:44,218 - INFO - num_of_iter_in_ot: 100 2024-04-29 00:26:44,218 - INFO - norm_cood : 0 2024-04-29 00:26:44,218 - INFO - run_name : FFNet-16-1e-5_1e-5-4_1-21 2024-04-29 00:26:44,218 - INFO - wandb : 0 2024-04-29 00:26:44,218 - INFO - seed : 21 2024-04-29 00:26:45,299 - INFO - using 1 gpus number of img: 1081 number of img: 120 2024-04-29 00:26:48,228 - INFO - random initialization 2024-04-29 00:26:48,229 - INFO - -----Epoch 0/2000----- /home/user/zx/FFNet/losses/bregmanpytorch.py:173: UserWarning: An output with one or more elements was resized since it had shape [4096], which does not match the required output shape [1, 4096]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize(0). (Triggered internally at ../aten/src/ATen/native/Resize.cpp:28.) torch.matmul(u, K, out=KTu) Traceback (most recent call last): File "/home/user/zx/FFNet/train.py", line 93, in