Closed surfii3z closed 3 years ago
Hi all,
Have anyone experience this horovod deadlock problem?
I have train packnet with KITTI dataset and it works up until epoch 43 and then it faced this problem.
######################################################################################################################## ### Config: configs.default_config -> configs.train_kitti_velsup.yaml ### Name: hearty-shadow-5 -> https://app.wandb.ai/surfii3z/packnet_sfm_kitti_thesis/runs/3ldjj669 ######################################################################################################################## Epoch 43 | Avg.Loss 0.0716: 100%|████████████████████████████████████████████| 79648/79648 [57:57<00:00, 22.91 images/s] KITTI_raw-eigen_val_files-velodyne: 100%|████████████████████████████████████████| 888/888 [00:25<00:00, 35.25 images/s] KITTI_raw-eigen_test_files-velodyne: 100%|███████████████████████████████████████| 700/700 [00:19<00:00, 35.07 images/s][2021-05-07 17:08:23. 18965: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. |*********************************************************************************************| | E: 44 BS: 8 - VelSupModel LR (Adam): Depth 1.00e-04 Pose 1.00e-04 | |*********************************************************************************************| | METRIC | abs_rel | sqr_rel | rmse | rmse_log | a1 | a2 | a3 | |*********************************************************************************************| | *** /data/datasets/KITTI_raw/data_splits/eigen_val_files.txt | |*********************************************************************************************| | DEPTH | 0.087 | 0.886 | 4.198 | 0.171 | 0.918 | 0.965 | 0.981 | | DEPTH_PP | 0.085 | 0.866 | 4.131 | 0.169 | 0.920 | 0.966 | 0.982 | | DEPTH_GT | 0.091 | 0.879 | 4.195 | 0.166 | 0.921 | 0.967 | 0.983 | | DEPTH_PP_GT | 0.089 | 0.855 | 4.120 | 0.164 | 0.923 | 0.968 | 0.983 | |*********************************************************************************************| | *** /data/datasets/KITTI_raw/data_splits/eigen_test_files.txt | |*********************************************************************************************| | DEPTH | 0.125 | 0.971 | 5.124 | 0.214 | 0.839 | 0.945 | 0.976 | | DEPTH_PP | 0.123 | 0.942 | 5.051 | 0.212 | 0.841 | 0.946 | 0.977 | | DEPTH_GT | 0.124 | 0.931 | 4.946 | 0.201 | 0.859 | 0.953 | 0.980 | | DEPTH_PP_GT | 0.123 | 0.899 | 4.866 | 0.199 | 0.861 | 0.954 | 0.980 | |*********************************************************************************************| | https://app.wandb.ai/surfii3z/packnet_sfm_kitti_thesis/runs/3ldjj669 hearty-shadow-5 | |*********************************************************************************************| Stalled ranks:
My system is DGX-station with 4x Tesla V100 32 GBs on Ubuntu 18.04
Any help would be appreciated.
Strangely, it get stuck even when continuing from the checkpoint. I need to load the checkpoint into the new config and continue the training.
Hi all,
Have anyone experience this horovod deadlock problem?
I have train packnet with KITTI dataset and it works up until epoch 43 and then it faced this problem.
My system is DGX-station with 4x Tesla V100 32 GBs on Ubuntu 18.04
Any help would be appreciated.