Unexpected endless waiting

Chi-Zaozao commented 4 years ago

when I use 4 gpus to train a model, the training process always wait something suddenly. Do nothing, and take up the gpu resources. When I forcibly stop the process, It shows:

File "/opt/conda/lib/python3.6/runpy.py" , line 193, in _run_module_as_main
  "__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py" , line 85, in _run_code
  exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py" , line 235, in <module>
  main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py" , line 228, in main
  process.wait()
File "/opt/conda/lib/python3.6/subprocess.py" , line 1477, in wait
  (pid, sts) = self._try_wait(0)
File "/opt/conda/lib/python3.6/subprocess.py", line 1424, in _try_wait
  (pid, sts) = os.waitpid(self.pid, wait_flags)

Even after half a day, the process do nothing but wait. Can you explain this? How to continue the training process or aviod the endless waiting?

poodarchu commented 4 years ago

can you provide more logs?

Chi-Zaozao commented 4 years ago

Logs seems nothing wrong, and my AP is always zero on my own dataset, can you give me some tips?

2020-06-01 06:54:04,112 - INFO - Start running, host: root@d06f7d837d96, work_dir: /workspace/det3d_requirement/Det3D/data/Outputs/Det3D_Outputs/SECOND_lucky_3d_medium_20200601-065347
2020-06-01 06:54:04,112 - INFO - workflow: [('train', 4), ('val', 1)], max: 20000 epochs
2020-06-01 06:54:56,282 - INFO - Epoch [1/20000][5/9]   lr: 0.00042, eta: 21 days, 17:00:39, time: 10.421, data_time: 7.979, transfer_time: 0.036, forward_time: 2.248, loss_parse_time: 0.000 memory: 2604, 
2020-06-01 06:54:56,282 - INFO - task : ['concealed'], loss: 4037.2042, cls_pos_loss: 0.0285, cls_neg_loss: 4034.4103, dir_loss_reduced: 0.3619, cls_loss_reduced: 4034.4386, loc_loss_reduced: 2.6931, loc_loss_elem: ['0.1202', '0.1082', '0.2176', '0.3811', '0.4084', '0.0097', '0.1014'], num_pos: 106.4000, num_neg: 70293.6000

2020-06-01 06:56:13,459 - INFO - Epoch [2/20000][5/9]   lr: 0.00042, eta: 15 days, 12:36:19, time: 10.447, data_time: 8.305, transfer_time: 0.039, forward_time: 1.941, loss_parse_time: 0.000 memory: 2684, 
2020-06-01 06:56:13,460 - INFO - task : ['concealed'], loss: 4726.4017, cls_pos_loss: 0.0230, cls_neg_loss: 4724.0129, dir_loss_reduced: 0.2736, cls_loss_reduced: 4724.0360, loc_loss_reduced: 2.3110, loc_loss_elem: ['0.1047', '0.1210', '0.1604', '0.3117', '0.3872', '0.0043', '0.0662'], num_pos: 84.4000, num_neg: 70315.6000

2020-06-01 06:57:31,074 - INFO - Epoch [3/20000][5/9]   lr: 0.00042, eta: 14 days, 6:50:41, time: 10.678, data_time: 9.912, transfer_time: 0.042, forward_time: 0.564, loss_parse_time: 0.000 memory: 2776, 
2020-06-01 06:57:31,075 - INFO - task : ['concealed'], loss: 3978.9130, cls_pos_loss: 0.0284, cls_neg_loss: 3976.3259, dir_loss_reduced: 0.3242, cls_loss_reduced: 3976.3543, loc_loss_reduced: 2.4938, loc_loss_elem: ['0.0826', '0.1415', '0.1897', '0.3005', '0.4440', '0.0050', '0.0836'], num_pos: 16.4000, num_neg: 70381.2000

2020-06-01 06:58:49,584 - INFO - Epoch [4/20000][5/9]   lr: 0.00042, eta: 13 days, 19:46:11, time: 10.928, data_time: 9.460, transfer_time: 0.033, forward_time: 1.275, loss_parse_time: 0.000 memory: 2776, 
2020-06-01 06:58:49,584 - INFO - task : ['concealed'], loss: 3282.1803, cls_pos_loss: 0.0338, cls_neg_loss: 3278.9272, dir_loss_reduced: 0.3941, cls_loss_reduced: 3278.9610, loc_loss_reduced: 3.1404, loc_loss_elem: ['0.1287', '0.1504', '0.2143', '0.4332', '0.5010', '0.0041', '0.1386'], num_pos: 132.4000, num_neg: 70266.4000

2020-06-01 06:59:13,164 - INFO - work dir: /workspace/det3d_requirement/Det3D/data/Outputs/Det3D_Outputs/SECOND_lucky_3d_medium_20200601-065347
2020-06-01 06:59:29,393 - INFO - 

2020-06-01 06:59:29,394 - INFO - Evaluation official: concealed AP(Average Precision)@0.01, 0.02, 0.03:
bbox AP:0.00, 0.00, 0.00
bev  AP:0.00, 0.00, 0.00
3d   AP:0.00, 0.00, 0.00
concealed AP(Average Precision)@0.01, 0.01, 0.01:
bbox AP:0.00, 0.00, 0.00
bev  AP:0.00, 0.00, 0.00
3d   AP:0.00, 0.00, 0.00

2020-06-01 06:59:29,394 - INFO - Evaluation coco: concealed coco AP@0.05:0.10:0.95:
bbox AP:0.00, 0.00, 0.00
bev  AP:0.00, 0.00, 0.00
3d   AP:0.00, 0.00, 0.00

2020-06-01 06:59:29,394 - INFO - Epoch(val) [4][2]  
2020-06-01 06:59:29,394 - INFO - task : ['concealed']

2020-06-01 07:00:21,357 - INFO - Epoch [5/20000][5/9]   lr: 0.00042, eta: 13 days, 10:15:00, time: 10.387, data_time: 10.068, transfer_time: 0.044, forward_time: 0.099, loss_parse_time: 0.000 memory: 2858, 
2020-06-01 07:00:21,358 - INFO - task : ['concealed'], loss: 1665.7547, cls_pos_loss: 0.0535, cls_neg_loss: 1661.6248, dir_loss_reduced: 0.5251, cls_loss_reduced: 1661.6783, loc_loss_reduced: 3.9714, loc_loss_elem: ['0.1673', '0.2135', '0.2520', '0.5354', '0.6647', '0.0105', '0.1423'], num_pos: 453.6000, num_neg: 69946.4000

poodarchu commented 4 years ago

Can this bug be consistently reproduced?

idontlikelongname commented 4 years ago

same problem, happened at the begining of training

idontlikelongname commented 4 years ago

when I use 4 gpus to train a model, the training process always wait something suddenly. Do nothing, and take up the gpu resources. When I forcibly stop the process, It shows:
File "/opt/conda/lib/python3.6/runpy.py" , line 193, in _run_module_as_main
  "__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py" , line 85, in _run_code
  exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py" , line 235, in <module>
  main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py" , line 228, in main
  process.wait()
File "/opt/conda/lib/python3.6/subprocess.py" , line 1477, in wait
  (pid, sts) = self._try_wait(0)
File "/opt/conda/lib/python3.6/subprocess.py", line 1424, in _try_wait
  (pid, sts) = os.waitpid(self.pid, wait_flags)
Even after half a day, the process do nothing but wait. Can you explain this? How to continue the training process or aviod the endless waiting?

did you solve this problem?

Chi-Zaozao commented 4 years ago

I didn‘t. Maybe it’s jupyter notebook‘s fault which was used to train a model

idontlikelongname commented 4 years ago

I didn‘t. Maybe it’s jupyter notebook‘s fault which was used to train a model

fixed it, maybe some bugs with pytorch 1.4 and nccl, fixed it by add params "--nnodes=1 --node_rank=0" to train.sh

python -m torch.distributed.launch \
    --nproc_per_node=10 \
    --nnodes=1 \
    --node_rank=0 \
    ./tools/train.py \

Chi-Zaozao commented 4 years ago

It works! Thank you very much!

V2AI / Det3D

Unexpected endless waiting #106