Closed Chi-Zaozao closed 4 years ago
can you provide more logs?
Logs seems nothing wrong, and my AP is always zero on my own dataset, can you give me some tips?
2020-06-01 06:54:04,112 - INFO - Start running, host: root@d06f7d837d96, work_dir: /workspace/det3d_requirement/Det3D/data/Outputs/Det3D_Outputs/SECOND_lucky_3d_medium_20200601-065347
2020-06-01 06:54:04,112 - INFO - workflow: [('train', 4), ('val', 1)], max: 20000 epochs
2020-06-01 06:54:56,282 - INFO - Epoch [1/20000][5/9] lr: 0.00042, eta: 21 days, 17:00:39, time: 10.421, data_time: 7.979, transfer_time: 0.036, forward_time: 2.248, loss_parse_time: 0.000 memory: 2604,
2020-06-01 06:54:56,282 - INFO - task : ['concealed'], loss: 4037.2042, cls_pos_loss: 0.0285, cls_neg_loss: 4034.4103, dir_loss_reduced: 0.3619, cls_loss_reduced: 4034.4386, loc_loss_reduced: 2.6931, loc_loss_elem: ['0.1202', '0.1082', '0.2176', '0.3811', '0.4084', '0.0097', '0.1014'], num_pos: 106.4000, num_neg: 70293.6000
2020-06-01 06:56:13,459 - INFO - Epoch [2/20000][5/9] lr: 0.00042, eta: 15 days, 12:36:19, time: 10.447, data_time: 8.305, transfer_time: 0.039, forward_time: 1.941, loss_parse_time: 0.000 memory: 2684,
2020-06-01 06:56:13,460 - INFO - task : ['concealed'], loss: 4726.4017, cls_pos_loss: 0.0230, cls_neg_loss: 4724.0129, dir_loss_reduced: 0.2736, cls_loss_reduced: 4724.0360, loc_loss_reduced: 2.3110, loc_loss_elem: ['0.1047', '0.1210', '0.1604', '0.3117', '0.3872', '0.0043', '0.0662'], num_pos: 84.4000, num_neg: 70315.6000
2020-06-01 06:57:31,074 - INFO - Epoch [3/20000][5/9] lr: 0.00042, eta: 14 days, 6:50:41, time: 10.678, data_time: 9.912, transfer_time: 0.042, forward_time: 0.564, loss_parse_time: 0.000 memory: 2776,
2020-06-01 06:57:31,075 - INFO - task : ['concealed'], loss: 3978.9130, cls_pos_loss: 0.0284, cls_neg_loss: 3976.3259, dir_loss_reduced: 0.3242, cls_loss_reduced: 3976.3543, loc_loss_reduced: 2.4938, loc_loss_elem: ['0.0826', '0.1415', '0.1897', '0.3005', '0.4440', '0.0050', '0.0836'], num_pos: 16.4000, num_neg: 70381.2000
2020-06-01 06:58:49,584 - INFO - Epoch [4/20000][5/9] lr: 0.00042, eta: 13 days, 19:46:11, time: 10.928, data_time: 9.460, transfer_time: 0.033, forward_time: 1.275, loss_parse_time: 0.000 memory: 2776,
2020-06-01 06:58:49,584 - INFO - task : ['concealed'], loss: 3282.1803, cls_pos_loss: 0.0338, cls_neg_loss: 3278.9272, dir_loss_reduced: 0.3941, cls_loss_reduced: 3278.9610, loc_loss_reduced: 3.1404, loc_loss_elem: ['0.1287', '0.1504', '0.2143', '0.4332', '0.5010', '0.0041', '0.1386'], num_pos: 132.4000, num_neg: 70266.4000
2020-06-01 06:59:13,164 - INFO - work dir: /workspace/det3d_requirement/Det3D/data/Outputs/Det3D_Outputs/SECOND_lucky_3d_medium_20200601-065347
2020-06-01 06:59:29,393 - INFO -
2020-06-01 06:59:29,394 - INFO - Evaluation official: concealed AP(Average Precision)@0.01, 0.02, 0.03:
bbox AP:0.00, 0.00, 0.00
bev AP:0.00, 0.00, 0.00
3d AP:0.00, 0.00, 0.00
concealed AP(Average Precision)@0.01, 0.01, 0.01:
bbox AP:0.00, 0.00, 0.00
bev AP:0.00, 0.00, 0.00
3d AP:0.00, 0.00, 0.00
2020-06-01 06:59:29,394 - INFO - Evaluation coco: concealed coco AP@0.05:0.10:0.95:
bbox AP:0.00, 0.00, 0.00
bev AP:0.00, 0.00, 0.00
3d AP:0.00, 0.00, 0.00
2020-06-01 06:59:29,394 - INFO - Epoch(val) [4][2]
2020-06-01 06:59:29,394 - INFO - task : ['concealed']
2020-06-01 07:00:21,357 - INFO - Epoch [5/20000][5/9] lr: 0.00042, eta: 13 days, 10:15:00, time: 10.387, data_time: 10.068, transfer_time: 0.044, forward_time: 0.099, loss_parse_time: 0.000 memory: 2858,
2020-06-01 07:00:21,358 - INFO - task : ['concealed'], loss: 1665.7547, cls_pos_loss: 0.0535, cls_neg_loss: 1661.6248, dir_loss_reduced: 0.5251, cls_loss_reduced: 1661.6783, loc_loss_reduced: 3.9714, loc_loss_elem: ['0.1673', '0.2135', '0.2520', '0.5354', '0.6647', '0.0105', '0.1423'], num_pos: 453.6000, num_neg: 69946.4000
Can this bug be consistently reproduced?
same problem, happened at the begining of training
when I use 4 gpus to train a model, the training process always wait something suddenly. Do nothing, and take up the gpu resources. When I forcibly stop the process, It shows:
File "/opt/conda/lib/python3.6/runpy.py" , line 193, in _run_module_as_main "__main__", mod_spec) File "/opt/conda/lib/python3.6/runpy.py" , line 85, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py" , line 235, in <module> main() File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py" , line 228, in main process.wait() File "/opt/conda/lib/python3.6/subprocess.py" , line 1477, in wait (pid, sts) = self._try_wait(0) File "/opt/conda/lib/python3.6/subprocess.py", line 1424, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags)
Even after half a day, the process do nothing but wait. Can you explain this? How to continue the training process or aviod the endless waiting?
did you solve this problem?
I didn‘t. Maybe it’s jupyter notebook‘s fault which was used to train a model
I didn‘t. Maybe it’s jupyter notebook‘s fault which was used to train a model
fixed it, maybe some bugs with pytorch 1.4 and nccl, fixed it by add params "--nnodes=1 --node_rank=0" to train.sh
python -m torch.distributed.launch \
--nproc_per_node=10 \
--nnodes=1 \
--node_rank=0 \
./tools/train.py \
It works! Thank you very much!
when I use 4 gpus to train a model, the training process always wait something suddenly. Do nothing, and take up the gpu resources. When I forcibly stop the process, It shows:
Even after half a day, the process do nothing but wait. Can you explain this? How to continue the training process or aviod the endless waiting?