Closed hutuo1213 closed 1 year ago
Hi, I found out that it works with two GPUs. The batchsize is shown as 16. So is the actual running batchsize 16 or 32? What is the problem with running one GPU?
Namespace(alpha=0.5, aux_loss=True, backbone='resnet50', batch_size=16, detector='base', epochs=30, seed=42, world_size=2)
Rank 1: Load weights for the object detector from checkpoints/detr-r50-hicodet.pth
=> Rank 1: PViC randomly initialised.
Rank 0: Load weights for the object detector from checkpoints/detr-r50-hicodet.pth
=> Rank 0: PViC randomly initialised.
Epoch 0 => mAP: 0.1369, rare: 0.0926, none-rare: 0.1501.
Epoch [1/30], Iter. [0100/2352], Loss: 4.0167, Time[Data/Iter.]: [11.92s/113.07s]
Epoch [1/30], Iter. [2300/2352], Loss: 1.5389, Time[Data/Iter.]: [0.25s/69.78s]
Epoch 1 => mAP: 0.2485, rare: 0.1888, none-rare: 0.2663.
Epoch [2/30], Iter. [0048/2352], Loss: 1.5435, Time[Data/Iter.]: [9.55s/90.46s]
Epoch [2/30], Iter. [2348/2352], Loss: 1.4581, Time[Data/Iter.]: [0.31s/72.10s]
Epoch 2 => mAP: 0.2700, rare: 0.2171, none-rare: 0.2858.
Hi @yaoyaosanqi,
The batch size has been changed to refer to the total batch size instead of the per-rank batch size. The images will be evenly divided across all GPUs, rounded down if necessary.
The nan
loss has been an issue related to the spatial features (https://github.com/fredzzhang/upt/issues/71, https://github.com/fredzzhang/upt/issues/35). But as the batch size is already big enough, there may be some issues with bad initialisations. Either way, if multiple GPUs solve the issue, you should probably stick to it.
Fred.
Thanks, the above problem is solved. I observed that you optimized the evaluation process of VCOCO. What does the output mAP mean? Scenario 1 or Scenario 2 or something else?
Rank 1: Load weights for the object detector from checkpoints/detr-r50-vcoco.pth
Rank 0: Load weights for the object detector from checkpoints/detr-r50-vcoco.pth
=> Rank 1: PViC randomly initialised.
=> Rank 0: PViC randomly initialised.
Epoch 0 => mAP: 0.3640.
Epoch [1/30], Iter. [100/311], Loss: 1.7373, Time[Data/Iter.]: [6.72s/72.60s]
Epoch [1/30], Iter. [200/311], Loss: 1.3745, Time[Data/Iter.]: [0.20s/64.13s]
Epoch [1/30], Iter. [300/311], Loss: 1.2762, Time[Data/Iter.]: [0.23s/65.20s]
Epoch 1 => mAP: 0.6238.
...
Epoch [29/30], Iter. [092/311], Loss: 0.7427, Time[Data/Iter.]: [6.89s/83.38s]
Epoch [29/30], Iter. [192/311], Loss: 0.7575, Time[Data/Iter.]: [0.34s/79.13s]
Epoch [29/30], Iter. [292/311], Loss: 0.7834, Time[Data/Iter.]: [0.28s/82.08s]
Epoch 29 => mAP: 0.7173.
Epoch [30/30], Iter. [081/311], Loss: 0.7520, Time[Data/Iter.]: [8.45s/92.81s]
Epoch [30/30], Iter. [181/311], Loss: 0.7621, Time[Data/Iter.]: [0.26s/79.81s]
Epoch [30/30], Iter. [281/311], Loss: 0.7563, Time[Data/Iter.]: [0.29s/80.52s]
Epoch 30 => mAP: 0.7177.
It is only a diagnostic tool, which may be removed later. There are still many issues with the v-coco code I need to fix.
Thanks!
Hi, thanks for your great work. I am using the DETR model (from UPT) to train on a GPU with a batchsize of 16. The training loss is NaN and the problem is still not solved after using different random seeds. Can you give me any advice?