dvlab-research / SphereFormer

The official implementation for "Spherical Transformer for LiDAR-based 3D Recognition" (CVPR 2023).
Apache License 2.0
300 stars 34 forks source link

NaN or Inf found in input tensor while training data, but training continues #56

Open Jayku88 opened 1 year ago

Jayku88 commented 1 year ago

[09/14 17:15:17 main-logger]: Epoch: [1/2][1310/19130] Data 0.001 (0.002) Batch 1.075 (1.127) Remain 11:33:44 Loss 1.0926 Lr: [0.00581479, 0.00058148] Accuracy 0.6878. NaN or Inf found in input tensor. [09/14 17:15:27 main-logger]: Epoch: [1/2][1320/19130] Data 0.001 (0.002) Batch 1.000 (1.126) Remain 11:33:00 Loss nan Lr: [0.00581338, 0.00058134] Accuracy 0.0689. NaN or Inf found in input tensor. [09/14 17:15:37 main-logger]: Epoch: [1/2][1330/19130] Data 0.001 (0.002) Batch 0.932 (1.125) Remain 11:32:20 Loss 0.6963 Lr: [0.00581196, 0.0005812] Accuracy 0.7945. NaN or Inf found in input tensor. [09/14 17:15:47 main-logger]: Epoch: [1/2][1340/19130] Data 0.001 (0.002) Batch 0.963 (1.124) Remain 11:31:31 Loss 1.0028 Lr: [0.00581054, 0.00058105] Accuracy 0.6936. NaN or Inf found in input tensor. [09/14 17:15:57 main-logger]: Epoch: [1/2][1350/19130] Data 0.001 (0.002) Batch 0.900 (1.123) Remain 11:30:41 Loss 1.0634 Lr: [0.00580913, 0.00058091] Accuracy 0.6222. NaN or Inf found in input tensor. NaN or Inf found in input tensor. [09/14 17:16:08 main-logger]: Epoch: [1/2][1360/19130] Data 0.001 (0.002) Batch 1.074 (1.123) Remain 11:30:32 Loss 0.8774 Lr: [0.00580771, 0.00058077] Accuracy 0.7543. NaN or Inf found in input tensor.

X-Lai commented 1 year ago

May I know if you modify the code or config file?

Jayku88 commented 1 year ago

Yes, the following are the modifications done in config file config/semantic_kitti/semantic_kitti_unet32_spherical_transformer.yaml

  1. Line 3: changed data_root to path in my system
  2. Line 50 : train_gpu:[0] instead of train_gpu:[0,1,2,3]
  3. Line 52: batch_size:1 (any other batchsize is showing cuda :Out of Memory ) I have Nvidia A4000 16 GB
X-Lai commented 1 year ago

I wonder NaN happens because the batch_size is too small, since only a single GPU is used. Can you try to use more GPUs for training?