Distributed training - Githubissues

TRI-ML / packnet-sfm

TRI-ML Monocular Depth Estimation Repository

https://tri-ml.github.io/packnet-sfm/

MIT License

1.24k stars 243 forks source link

Distributed training #134

Closed DFLyan closed 3 years ago

DFLyan commented 3 years ago

I want to train the network with several GPUs. I have seen the horovod module in the code, but it does not work. Whether I need to set another parameter to achieve my aim? Or I need to write the distributing training code extra.

VitorGuizilini-TRI commented 3 years ago

The horovod distributed training should work, how are you running it?

DFLyan commented 3 years ago

The horovod distributed training should work, how are you running it?

Thank you for your response. I have solved the problem. I haven't used the horovod module before, so I did not use "horovodrun -np 4 -H localhost:4" before "python train.py".