Epiphqny / VisTR

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers
https://arxiv.org/abs/2011.14503
Apache License 2.0
740 stars 96 forks source link

Multi-GPU training #72

Open Auroralyxa opened 2 years ago

Auroralyxa commented 2 years ago

Hi, using 8 p100,16G to train the model,I get partial error as below

number of params: 75833633
loading annotations into memory...
Done (t=20.13s)
creating index...
index created!
Start training
Epoch: [0]  [   0/7731]  eta: 15 days, 22:29:09  lr: 0.000100  class_error: 88.02  loss: 79.9960 (79.9960)  loss_bbox: 7.1450 (7.1450)  loss_bbox_0: 6.8492 (6.8492)  loss_bbox_1: 6.9822 (6.9822)  loss_bbox_2: 7.0482 (7.0482)  loss_bbox_3: 7.0655 (7.0655)  loss_bbox_4: 7.0992 (7.0992)  loss_ce: 3.8196 (3.8196)  loss_ce_0: 3.8680 (3.8680)  loss_ce_1: 3.7678 (3.7678)  loss_ce_2: 3.8110 (3.8110)  loss_ce_3: 3.8432 (3.8432)  loss_ce_4: 3.8701 (3.8701)  loss_dice: 0.7227 (0.7227)  loss_giou: 2.3688 (2.3688)  loss_giou_0: 2.3062 (2.3062)  loss_giou_1: 2.3316 (2.3316)  loss_giou_2: 2.3469 (2.3469)  loss_giou_3: 2.3410 (2.3410)  loss_giou_4: 2.3517 (2.3517)  loss_mask: 0.0581 (0.0581)  cardinality_error_unscaled: 306.0000 (306.0000)  cardinality_error_0_unscaled: 306.0000 (306.0000)  cardinality_error_1_unscaled: 306.0000 (306.0000)  cardinality_error_2_unscaled: 306.0000 (306.0000)  cardinality_error_3_unscaled: 306.0000 (306.0000)  cardinality_error_4_unscaled: 306.0000 (306.0000)  class_error_unscaled: 88.0208 (88.0208)  loss_bbox_unscaled: 1.4290 (1.4290)  loss_bbox_0_unscaled: 1.3698 (1.3698)  loss_bbox_1_unscaled: 1.3964 (1.3964)  loss_bbox_2_unscaled: 1.4096 (1.4096)  loss_bbox_3_unscaled: 1.4131 (1.4131)  loss_bbox_4_unscaled: 1.4198 (1.4198)  loss_ce_unscaled: 3.8196 (3.8196)  loss_ce_0_unscaled: 3.8680 (3.8680)  loss_ce_1_unscaled: 3.7678 (3.7678)  loss_ce_2_unscaled: 3.8110 (3.8110)  loss_ce_3_unscaled: 3.8432 (3.8432)  loss_ce_4_unscaled: 3.8701 (3.8701)  loss_dice_unscaled: 0.7227 (0.7227)  loss_giou_unscaled: 1.1844 (1.1844)  loss_giou_0_unscaled: 1.1531 (1.1531)  loss_giou_1_unscaled: 1.1658 (1.1658)  loss_giou_2_unscaled: 1.1734 (1.1734)  loss_giou_3_unscaled: 1.1705 (1.1705)  loss_giou_4_unscaled: 1.1759 (1.1759)  loss_mask_unscaled: 0.0581 (0.0581)  time: 178.1075  data: 138.0209  max mem: 3236
Traceback (most recent call last):
  File "main.py", line 212, in <module>
    main(args)
  File "main.py", line 185, in main
    args.clip_max_norm)
  File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch
    losses.backward()
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 1800000ms for recv operation to complete
Traceback (most recent call last):
  File "main.py", line 212, in <module>
    main(args)
  File "main.py", line 185, in main
    args.clip_max_norm)
  File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch
    losses.backward()
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:103] Timed out waiting 1800000ms for send operation to complete
Traceback (most recent call last):
  File "main.py", line 212, in <module>
    main(args)
  File "main.py", line 185, in main
    args.clip_max_norm)
  File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch
    losses.backward()
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:103] Timed out waiting 1800000ms for send operation to complete
Traceback (most recent call last):
  File "main.py", line 212, in <module>
    main(args)
  File "main.py", line 185, in main
    args.clip_max_norm)
  File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch
    losses.backward()
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 1800000ms for recv operation to complete
Traceback (most recent call last):
  File "main.py", line 212, in <module>
    main(args)
  File "main.py", line 185, in main
    args.clip_max_norm)
  File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch
    losses.backward()
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 1800000ms for recv operation to complete

pytorch 1.8.0,cuda 10.2,How to solve this error, do I need to adjust parameters? thanks in advance