Open Auroralyxa opened 2 years ago
Hi, using 8 p100,16G to train the model,I get partial error as below
number of params: 75833633 loading annotations into memory... Done (t=20.13s) creating index... index created! Start training Epoch: [0] [ 0/7731] eta: 15 days, 22:29:09 lr: 0.000100 class_error: 88.02 loss: 79.9960 (79.9960) loss_bbox: 7.1450 (7.1450) loss_bbox_0: 6.8492 (6.8492) loss_bbox_1: 6.9822 (6.9822) loss_bbox_2: 7.0482 (7.0482) loss_bbox_3: 7.0655 (7.0655) loss_bbox_4: 7.0992 (7.0992) loss_ce: 3.8196 (3.8196) loss_ce_0: 3.8680 (3.8680) loss_ce_1: 3.7678 (3.7678) loss_ce_2: 3.8110 (3.8110) loss_ce_3: 3.8432 (3.8432) loss_ce_4: 3.8701 (3.8701) loss_dice: 0.7227 (0.7227) loss_giou: 2.3688 (2.3688) loss_giou_0: 2.3062 (2.3062) loss_giou_1: 2.3316 (2.3316) loss_giou_2: 2.3469 (2.3469) loss_giou_3: 2.3410 (2.3410) loss_giou_4: 2.3517 (2.3517) loss_mask: 0.0581 (0.0581) cardinality_error_unscaled: 306.0000 (306.0000) cardinality_error_0_unscaled: 306.0000 (306.0000) cardinality_error_1_unscaled: 306.0000 (306.0000) cardinality_error_2_unscaled: 306.0000 (306.0000) cardinality_error_3_unscaled: 306.0000 (306.0000) cardinality_error_4_unscaled: 306.0000 (306.0000) class_error_unscaled: 88.0208 (88.0208) loss_bbox_unscaled: 1.4290 (1.4290) loss_bbox_0_unscaled: 1.3698 (1.3698) loss_bbox_1_unscaled: 1.3964 (1.3964) loss_bbox_2_unscaled: 1.4096 (1.4096) loss_bbox_3_unscaled: 1.4131 (1.4131) loss_bbox_4_unscaled: 1.4198 (1.4198) loss_ce_unscaled: 3.8196 (3.8196) loss_ce_0_unscaled: 3.8680 (3.8680) loss_ce_1_unscaled: 3.7678 (3.7678) loss_ce_2_unscaled: 3.8110 (3.8110) loss_ce_3_unscaled: 3.8432 (3.8432) loss_ce_4_unscaled: 3.8701 (3.8701) loss_dice_unscaled: 0.7227 (0.7227) loss_giou_unscaled: 1.1844 (1.1844) loss_giou_0_unscaled: 1.1531 (1.1531) loss_giou_1_unscaled: 1.1658 (1.1658) loss_giou_2_unscaled: 1.1734 (1.1734) loss_giou_3_unscaled: 1.1705 (1.1705) loss_giou_4_unscaled: 1.1759 (1.1759) loss_mask_unscaled: 0.0581 (0.0581) time: 178.1075 data: 138.0209 max mem: 3236 Traceback (most recent call last): File "main.py", line 212, in <module> main(args) File "main.py", line 185, in main args.clip_max_norm) File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch losses.backward() File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 1800000ms for recv operation to complete Traceback (most recent call last): File "main.py", line 212, in <module> main(args) File "main.py", line 185, in main args.clip_max_norm) File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch losses.backward() File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:103] Timed out waiting 1800000ms for send operation to complete Traceback (most recent call last): File "main.py", line 212, in <module> main(args) File "main.py", line 185, in main args.clip_max_norm) File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch losses.backward() File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:103] Timed out waiting 1800000ms for send operation to complete Traceback (most recent call last): File "main.py", line 212, in <module> main(args) File "main.py", line 185, in main args.clip_max_norm) File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch losses.backward() File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 1800000ms for recv operation to complete Traceback (most recent call last): File "main.py", line 212, in <module> main(args) File "main.py", line 185, in main args.clip_max_norm) File "E:\lyx\VisTR-master\engine.py", line 49, in train_one_epoch losses.backward() File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "E:\lyx\anaconda\envs\vistr\lib\site-packages\torch\autograd\__init__.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 1800000ms for recv operation to complete
pytorch 1.8.0,cuda 10.2,How to solve this error, do I need to adjust parameters? thanks in advance
Hi, using 8 p100,16G to train the model,I get partial error as below
pytorch 1.8.0,cuda 10.2,How to solve this error, do I need to adjust parameters? thanks in advance