The loss is NaN during training

hutuo1213 commented 1 year ago

Hi, thanks for your great work. I am using the DETR model (from UPT) to train on a GPU with a batchsize of 16. The training loss is NaN and the problem is still not solved after using different random seeds. Can you give me any advice?

Namespace(alpha=0.5, aux_loss=True, backbone='resnet50', batch_size=16,  detector='base', epochs=30, world_size=1)
Rank 0: Load weights for the object detector from checkpoints/detr-r50-hicodet.pth
=> Rank 0: PViC randomly initialised.

Epoch 0 =>  mAP: 0.1359, rare: 0.0912, none-rare: 0.1493.

Epoch [1/30], Iter. [0100/2353], Loss: 3.9044, Time[Data/Iter.]: [10.76s/153.78s]
...
Epoch [1/30], Iter. [2300/2353], Loss: 1.5333, Time[Data/Iter.]: [0.63s/147.35s]
...
100%|██████████| 597/597 [19:03<00:00,  1.92s/it]
Epoch 1 =>  mAP: 0.0127, rare: 0.0110, none-rare: 0.0132.

Traceback (most recent call last):
  File "main.py", line 195, in <module>
    mp.spawn(main, nprocs=args.world_size, args=(args,))
  File "/usr/local/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/pvic/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/quan/pvic/main.py", line 130, in main
    engine(args.epochs)
  File "/home/quan/pvic/pocket/core/distributed.py", line 139, in __call__
    self._on_each_iteration()
  File "/home/quan/pvic/utils.py", line 195, in _on_each_iteration
    raise ValueError(f"The HOI loss is NaN for rank {self._rank}")
ValueError: The HOI loss is NaN for rank 0

hutuo1213 commented 1 year ago

Hi, I found out that it works with two GPUs. The batchsize is shown as 16. So is the actual running batchsize 16 or 32? What is the problem with running one GPU?

Namespace(alpha=0.5, aux_loss=True, backbone='resnet50', batch_size=16,  detector='base', epochs=30, seed=42, world_size=2)
Rank 1: Load weights for the object detector from checkpoints/detr-r50-hicodet.pth
=> Rank 1: PViC randomly initialised.
Rank 0: Load weights for the object detector from checkpoints/detr-r50-hicodet.pth
=> Rank 0: PViC randomly initialised.
Epoch 0 =>  mAP: 0.1369, rare: 0.0926, none-rare: 0.1501.
Epoch [1/30], Iter. [0100/2352], Loss: 4.0167, Time[Data/Iter.]: [11.92s/113.07s]
Epoch [1/30], Iter. [2300/2352], Loss: 1.5389, Time[Data/Iter.]: [0.25s/69.78s]
Epoch 1 =>  mAP: 0.2485, rare: 0.1888, none-rare: 0.2663.
Epoch [2/30], Iter. [0048/2352], Loss: 1.5435, Time[Data/Iter.]: [9.55s/90.46s]
Epoch [2/30], Iter. [2348/2352], Loss: 1.4581, Time[Data/Iter.]: [0.31s/72.10s]
Epoch 2 =>  mAP: 0.2700, rare: 0.2171, none-rare: 0.2858.

fredzzhang commented 1 year ago

Hi @yaoyaosanqi,

The batch size has been changed to refer to the total batch size instead of the per-rank batch size. The images will be evenly divided across all GPUs, rounded down if necessary.

The nan loss has been an issue related to the spatial features (https://github.com/fredzzhang/upt/issues/71, https://github.com/fredzzhang/upt/issues/35). But as the batch size is already big enough, there may be some issues with bad initialisations. Either way, if multiple GPUs solve the issue, you should probably stick to it.

Fred.

hutuo1213 commented 1 year ago

Thanks, the above problem is solved. I observed that you optimized the evaluation process of VCOCO. What does the output mAP mean? Scenario 1 or Scenario 2 or something else?

Rank 1: Load weights for the object detector from checkpoints/detr-r50-vcoco.pth
Rank 0: Load weights for the object detector from checkpoints/detr-r50-vcoco.pth
=> Rank 1: PViC randomly initialised.
=> Rank 0: PViC randomly initialised.
Epoch 0 =>  mAP: 0.3640.
Epoch [1/30], Iter. [100/311], Loss: 1.7373, Time[Data/Iter.]: [6.72s/72.60s]
Epoch [1/30], Iter. [200/311], Loss: 1.3745, Time[Data/Iter.]: [0.20s/64.13s]
Epoch [1/30], Iter. [300/311], Loss: 1.2762, Time[Data/Iter.]: [0.23s/65.20s]
Epoch 1 =>  mAP: 0.6238.
...
Epoch [29/30], Iter. [092/311], Loss: 0.7427, Time[Data/Iter.]: [6.89s/83.38s]
Epoch [29/30], Iter. [192/311], Loss: 0.7575, Time[Data/Iter.]: [0.34s/79.13s]
Epoch [29/30], Iter. [292/311], Loss: 0.7834, Time[Data/Iter.]: [0.28s/82.08s]
Epoch 29 => mAP: 0.7173.
Epoch [30/30], Iter. [081/311], Loss: 0.7520, Time[Data/Iter.]: [8.45s/92.81s]
Epoch [30/30], Iter. [181/311], Loss: 0.7621, Time[Data/Iter.]: [0.26s/79.81s]
Epoch [30/30], Iter. [281/311], Loss: 0.7563, Time[Data/Iter.]: [0.29s/80.52s]
Epoch 30 => mAP: 0.7177.

fredzzhang commented 1 year ago

It is only a diagnostic tool, which may be removed later. There are still many issues with the v-coco code I need to fix.

hutuo1213 commented 1 year ago

Thanks！

fredzzhang / pvic

The loss is NaN during training #29