cfzd / Ultra-Fast-Lane-Detection-v2

Ultra Fast Deep Lane Detection With Hybrid Anchor Driven Ordinal Classification (TPAMI 2022)
MIT License
559 stars 95 forks source link

DDP训练,一直卡在Evaluating the results...不动,然后Watchdog caught collective operation timeout #119

Open licc0431 opened 1 year ago

licc0431 commented 1 year ago

[环境] CU11.7,torch1.13.1,T4卡,ubunut16.04,nvidia-dali-cuda110 1.25.0,nccl2.14.3,gcc7.5 [运行命令] python -m torch.distributed.launch --nproc_per_node=4 train.py configs/culane_res18.py [终端输出] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2023/04/27 15:28:31] start training... Config (path: configs/culane_res18.py): {'dataset': 'CULane', 'data_root': '/home/lichengchao/dataset/lane-detection/CULane', 'epoch': 50, 'batch_size': 32, 'optimizer': 'SGD', 'learning_rate': 0.05, 'weight_decay': 0.0001, 'momentum': 0.9, 'scheduler': 'multi', 'steps': [25, 38], 'gamma': 0.1, 'warmup': 'linear', 'warmup_iters': 695, 'use_aux': False, 'griding_num': 200, 'backbone': '18', 'sim_loss_w': 0.0, 'shp_loss_w': 0.0, 'note': '', 'log_path': './runs', 'finetune': None, 'resume': None, 'test_model': '', 'test_work_dir': './runs/20230427_152819_lr_5e-02_b_32', 'tta': True, 'num_lanes': 4, 'var_loss_power': 2.0, 'auto_backup': True, 'num_row': 18, 'num_col': 41, 'train_width': 1600, 'train_height': 320, 'num_cell_row': 200, 'num_cell_col': 100, 'mean_loss_w': 0.05, 'fc_norm': True, 'crop_ratio': 0.6, 'row_anchor': array([0.42 , 0.45411765, 0.48823529, 0.52235294, 0.55647059, 0.59058824, 0.62470588, 0.65882353, 0.69294118, 0.72705882, 0.76117647, 0.79529412, 0.82941176, 0.86352941, 0.89764706, 0.93176471, 0.96588235, 1. ]), 'col_anchor': array([0. , 0.025, 0.05 , 0.075, 0.1 , 0.125, 0.15 , 0.175, 0.2 , 0.225, 0.25 , 0.275, 0.3 , 0.325, 0.35 , 0.375, 0.4 , 0.425, 0.45 , 0.475, 0.5 , 0.525, 0.55 , 0.575, 0.6 , 0.625, 0.65 , 0.675, 0.7 , 0.725, 0.75 , 0.775, 0.8 , 0.825, 0.85 , 0.875, 0.9 , 0.925, 0.95 , 0.975, 1. ]), 'distributed': True} loading cached data cached data loaded /home/lichengchao/anaconda3/envs/lane-det/lib/python3.7/site-packages/torchvision/models/_utils.py:209: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. f"The parameter '{pretrained_param}' is deprecated since 0.13 and may be removed in the future, " /home/lichengchao/anaconda3/envs/lane-det/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=ResNet18_Weights.IMAGENET1K_V1. You can also use weights=ResNet18_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) /home/lichengchao/anaconda3/envs/lane-det/lib/python3.7/site-packages/torchvision/models/_utils.py:209: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. f"The parameter '{pretrained_param}' is deprecated since 0.13 and may be removed in the future, " /home/lichengchao/anaconda3/envs/lane-det/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=ResNet18_Weights.IMAGENET1K_V1. You can also use weights=ResNet18_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) /home/lichengchao/anaconda3/envs/lane-det/lib/python3.7/site-packages/torchvision/models/_utils.py:209: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. f"The parameter '{pretrained_param}' is deprecated since 0.13 and may be removed in the future, " /home/lichengchao/anaconda3/envs/lane-det/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=ResNet18_Weights.IMAGENET1K_V1. You can also use weights=ResNet18_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg)

.....

100%|██████████| 695/695 [10:46<00:00, 1.07it/s, col_top1=0.027, col_top2=0.073, col_top3=0.115, ext_col=0.910, ext_row=0.885, loss=6.641, top1=0.043, top2=0.135, top3=0.206]

......

100%|██████████| 1084/1084 [03:55<00:00, 4.59it/s]

------------Configuration--------- anno_dir: /home/lichengchao/dataset/lane-detection/CULane/ detect_dir: ./runs/20230427_152819_lr_5e-02_b_32/culane_eval_tmp/ im_dir: /home/lichengchao/dataset/lane-detection/CULane/ list_im_file: /home/lichengchao/dataset/lane-detection/CULane/list/test_split/test0_normal.txt width_lane: 30 iou_threshold: 0.5 im_width: 1640 im_height: 590

Evaluating the results... tp: 18690 fp: 14589 fn: 14087 finished process file precision: 0.561615 recall: 0.570217 Fmeasure: 0.565883 ------------Configuration--------- anno_dir: /home/lichengchao/dataset/lane-detection/CULane/ detect_dir: ./runs/20230427_152819_lr_5e-02_b_32/culane_eval_tmp/ im_dir: /home/lichengchao/dataset/lane-detection/CULane/ list_im_file: /home/lichengchao/dataset/lane-detection/CULane/list/test_split/test1_crowd.txt width_lane: 30 iou_threshold: 0.5 im_width: 1640 im_height: 590 x_factor: 1 y_factor: 1

Evaluating the results... [E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4178, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803391 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4178, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803400 milliseconds before timing out. [E ProcessGroupNCCL.cpp:821] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4178, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803405 milliseconds before timing out. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 36491 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 36492) of binary: /home/lichengchao/anaconda3/envs/lane-det/bin/python Traceback (most recent call last): File "/home/lichengchao/anaconda3/envs/lane-det/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/lichengchao/anaconda3/envs/lane-det/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/lichengchao/anaconda3/envs/lane-det/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in main() File "/home/lichengchao/anaconda3/envs/lane-det/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/home/lichengchao/anaconda3/envs/lane-det/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/home/lichengchao/anaconda3/envs/lane-det/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run )(*cmd_args) File "/home/lichengchao/anaconda3/envs/lane-det/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lichengchao/anaconda3/envs/lane-det/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError: train.py FAILED

Failures: [1]: time : 2023-04-27_16:14:15 host : ubuntu-2288H-V5 rank : 2 (local_rank: 2) exitcode : -6 (pid: 36493) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 36493 [2]: time : 2023-04-27_16:14:15 host : ubuntu-2288H-V5 rank : 3 (local_rank: 3) exitcode : -6 (pid: 36494) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 36494 Root Cause (first observed failure): [0]: time : 2023-04-27_16:14:15 host : ubuntu-2288H-V5 rank : 1 (local_rank: 1) exitcode : -6 (pid: 36492) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 36492

tp: 11597 fp: 15956 fn: 16406 finished process file precision: 0.420898 recall: 0.414134 Fmeasure: 0.417489

[尝试的解决过程] 尝试了stackoverflow和github类似报错问题,都未解决,如launch.run、nccl问题,DDP问题、torch版本问题等,搞了一天未解决

是否有人遇到过相同问题,感谢指导下,在线等。抱拳

licc0431 commented 1 year ago

测试了torch1.8.2+cu102,一样的问题

cfzd commented 1 year ago

@licc0431 这个看起来很奇怪啊,因为第一个normal场景的评估都已经结束了,能在单卡的情况下跑一下看看会报错吗?

licc0431 commented 1 year ago

@licc0431 这个看起来很奇怪啊,因为第一个normal场景的评估都已经结束了,能在单卡的情况下跑一下看看会报错吗?

单卡可以

cfzd commented 1 year ago

@licc0431 那实在不行就关掉评估吧,让他只跑训练的代码,训完之后再做评估。可能需要稍微改一下保存模型部分的逻辑,每个epoch都保存

licc0431 commented 1 year ago

@licc0431 那实在不行就关掉评估吧,让他只跑训练的代码,训完之后再做评估。可能需要稍微改一下保存模型部分的逻辑,每个epoch都保存

明白你的意思了,谢谢