TRI-ML / packnet-sfm

TRI-ML Monocular Depth Estimation Repository
https://tri-ml.github.io/packnet-sfm/
MIT License
1.24k stars 243 forks source link

Horovod deadlock issue: training stuck at epoch 44 #137

Closed surfii3z closed 3 years ago

surfii3z commented 3 years ago

Hi all,

Have anyone experience this horovod deadlock problem?

I have train packnet with KITTI dataset and it works up until epoch 43 and then it faced this problem.

########################################################################################################################
### Config: configs.default_config -> configs.train_kitti_velsup.yaml
### Name: hearty-shadow-5 -> https://app.wandb.ai/surfii3z/packnet_sfm_kitti_thesis/runs/3ldjj669
########################################################################################################################
Epoch 43 | Avg.Loss 0.0716: 100%|████████████████████████████████████████████| 79648/79648 [57:57<00:00, 22.91 images/s]
KITTI_raw-eigen_val_files-velodyne: 100%|████████████████████████████████████████| 888/888 [00:25<00:00, 35.25 images/s]
KITTI_raw-eigen_test_files-velodyne: 100%|███████████████████████████████████████| 700/700 [00:19<00:00, 35.07 images/s][2021-05-07 17:08:23. 18965: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
|*********************************************************************************************|
| E: 44 BS: 8 - VelSupModel                           LR (Adam): Depth 1.00e-04 Pose 1.00e-04 |
|*********************************************************************************************|
|     METRIC     | abs_rel  | sqr_rel  |   rmse   | rmse_log |    a1    |    a2    |    a3    |
|*********************************************************************************************|
| *** /data/datasets/KITTI_raw/data_splits/eigen_val_files.txt                                |
|*********************************************************************************************|
| DEPTH          |  0.087   |  0.886   |  4.198   |  0.171   |  0.918   |  0.965   |  0.981   |
| DEPTH_PP       |  0.085   |  0.866   |  4.131   |  0.169   |  0.920   |  0.966   |  0.982   |
| DEPTH_GT       |  0.091   |  0.879   |  4.195   |  0.166   |  0.921   |  0.967   |  0.983   |
| DEPTH_PP_GT    |  0.089   |  0.855   |  4.120   |  0.164   |  0.923   |  0.968   |  0.983   |
|*********************************************************************************************|
| *** /data/datasets/KITTI_raw/data_splits/eigen_test_files.txt                               |
|*********************************************************************************************|
| DEPTH          |  0.125   |  0.971   |  5.124   |  0.214   |  0.839   |  0.945   |  0.976   |
| DEPTH_PP       |  0.123   |  0.942   |  5.051   |  0.212   |  0.841   |  0.946   |  0.977   |
| DEPTH_GT       |  0.124   |  0.931   |  4.946   |  0.201   |  0.859   |  0.953   |  0.980   |
| DEPTH_PP_GT    |  0.123   |  0.899   |  4.866   |  0.199   |  0.861   |  0.954   |  0.980   |
|*********************************************************************************************|
| https://app.wandb.ai/surfii3z/packnet_sfm_kitti_thesis/runs/3ldjj669                hearty-shadow-5 |
|*********************************************************************************************|
Stalled ranks: 

My system is DGX-station with 4x Tesla V100 32 GBs on Ubuntu 18.04

Any help would be appreciated.

surfii3z commented 3 years ago

Strangely, it get stuck even when continuing from the checkpoint. I need to load the checkpoint into the new config and continue the training.