TRI-ML / packnet-sfm

TRI-ML Monocular Depth Estimation Repository
https://tri-ml.github.io/packnet-sfm/
MIT License
1.24k stars 243 forks source link

How to perform multi-GPU training? #34

Closed Mrils closed 4 years ago

Mrils commented 4 years ago

Hello, I'm trying to train a model with VelSupModel. And I find out that only single GPU is working. How to perform multi-GPU training? thx.

VitorGuizilini-TRI commented 4 years ago

When you say that it doesn't work, do you mean it doesn't converge or training doesn't even start? Our training pipeline by default uses multi-GPU, with Horovod, so there is no "single-GPU" training. If your problem is that it doesn't converge, you might want to look into changing learning rates, velocity supervision loss weight, batch sizes and other training parameters.

Mrils commented 4 years ago

I'm afraid not. When I use 'nvidia-smi' to monitor the GPU memory usage, I find that only one GPU (id 0) is working, and when I adjust batch_size to 16, the GPU memory overflowed. I'm new to Horovod, So I am not sure if something is wrong with my configuration file. Here is my train.yaml model: Also I am not running in docker environment, but in anaconda virtual env. name: 'VelSupModel' optimizer: name: 'Adam' depth: lr: 0.0002 pose: lr: 0.0002 scheduler: name: 'StepLR' step_size: 30 gamma: 0.5 depth_net: name: 'DepthResNet' version: '50pt' pose_net: name: 'PoseNet' version: '' params: crop: 'garg' min_depth: 0.0 max_depth: 80.0 datasets: augmentation: image_shape: (192, 640) train: batch_size: 16 dataset: ['KITTI'] path: ['/datassd/datasets/KITTI_raw'] split: ['data_splits/eigen_zhou_files.txt'] depth_type: ['velodyne'] repeat: [2] validation: dataset: ['KITTI'] path: ['/datassd/datasets/KITTI_raw'] split: ['data_splits/eigen_val_files.txt', 'data_splits/eigen_test_files.txt'] depth_type: ['velodyne'] test: dataset: ['KITTI'] path: ['/datassd/datasets/KITTI_raw'] split: ['data_splits/eigen_test_files.txt'] depth_type: ['velodyne']

Mrils commented 4 years ago

Sorry, I have figured it out. thank you

Mrils commented 4 years ago

Another question: Did you turn off data augmentation during training? As the prob=1.0 in colorjitter_sample function. https://github.com/TRI-ML/packnet-sfm/blob/f824ffceba46ae1c621e1bf22a35634d8b39207c/packnet_sfm/datasets/augmentations.py#L197

krishna-esrlabs commented 4 years ago

@Mrils Hi, could you please tell me how to run multi-gpu training, I haven't figured out yet.

DRAEYE commented 4 years ago

Excuse me, can I ask how do you deal with this problem, on my computer, only one GPU is working

jdriscoll319 commented 3 years ago

Was anyone able to figure this out?