train error: after a time, NaN or Inf found in input tensor.

andife commented 3 years ago

Hello, executing python train.py --dataset owndataset --vit_name R50-ViT-B_16 --batch_size 12 --max_iterations 1000 --max_epochs 350

Any idea for the reasons for this?

iteration 755 : loss : 0.232612, loss_ce: 0.031451 iteration 756 : loss : 0.235377, loss_ce: 0.039011 iteration 757 : loss : 0.235090, loss_ce: 0.031103 iteration 758 : loss : 0.234754, loss_ce: 0.035912 iteration 759 : loss : 0.242864, loss_ce: 0.030254 iteration 760 : loss : 0.243939, loss_ce: 0.029230 11%|███▍ | 38/350 [04:56<41:17, 7.94s/it]iteration 761 : loss : 0.232332, loss_ce: 0.029950 iteration 762 : loss : 0.236290, loss_ce: 0.032639 iteration 763 : loss : 0.239043, loss_ce: 0.029412 iteration 764 : loss : 0.223232, loss_ce: 0.036379 iteration 765 : loss : 0.227415, loss_ce: 0.031555 iteration 766 : loss : 0.228688, loss_ce: 0.030908 iteration 767 : loss : 0.246761, loss_ce: 0.032261 iteration 768 : loss : 0.230575, loss_ce: 0.029101 NaN or Inf found in input tensor. NaN or Inf found in input tensor. iteration 769 : loss : nan, loss_ce: nan NaN or Inf found in input tensor. NaN or Inf found in input tensor. iteration 770 : loss : nan, loss_ce: nan NaN or Inf found in input tensor. NaN or Inf found in input tensor. iteration 771 : loss : nan, loss_ce: nan NaN or Inf found in input tensor. NaN or Inf found in input tensor. iteration 772 : loss : nan, loss_ce: nan NaN or Inf found in input tensor. NaN or Inf found in input tensor. iteration 773 : loss : nan, loss_ce: nan NaN or Inf found in input tensor. NaN or Inf found in input tensor. iteration 774 : loss : nan, loss_ce: nan NaN or Inf found in input tensor. NaN or Inf found in input tensor. iteration 775 : loss : nan, loss_ce: nan NaN or Inf found in input tensor.

Beckschen commented 3 years ago

Hello, thanks for your questions. May I kindly ask what's the learning rate you used in batch_size=12?

andife commented 3 years ago

The following is the start message

(base) user@pc1:~/project_TransUNet/TransUNet$ python train.py --dataset owndataset --vit_name R50-ViT-B_16 --batch_size 12 --max_iterations 1000 --max_epochs 350 Namespace(base_lr=0.005, batch_size=12, dataset='Owndataset', deterministic=1, exp='TU_Owndataset224', img_size=224, is_pretrain=True, list_dir='./lists/lists_Owndataset', max_epochs=350, max_iterations=1000, n_gpu=1, n_skip=3, num_classes=2, root_path='../data/Owndataset/train_npz', seed=1234, vit_name='R50-ViT-B_16', vit_patches_size=16) The length of train set is: 234 20 iterations per epoch. 7000 max iterations 0%| | 0/350 [00:00<?, ?it/s]iteration 1 : loss : 0.541960, loss_ce: 0.527893

I realized that I changed the train.py file in order to match the test.py

I added the following lines: < if args.batch_size != 24 and args.batch_size % 6 == 0: < args.base_lr *= args.batch_size / 24

=> Maybe it would be better if you didn't have to specify all the individual parameters for training again during the test, but only the checkpoint directory for the specific model values? My starting point was that due to the different LR treatment, the same command line arguments did not allow me to run a training and then a test.

congcongwy51 commented 3 weeks ago

请问您解决了吗

Beckschen / TransUNet

train error: after a time, NaN or Inf found in input tensor. #32