Open apple2373 opened 5 years ago
What operating system are you using? Most Linux distributions has OpenMPI available from package manager. For example, in Ubuntu it's quite easy like apt-get install openmpi-bin
and pip install mpi4py
. Then you can run mpirun -np 2 python train_multi.py --model faster_rcnn_fpn_resnet50
if you have two GPUs in your machine. Hopefully I'd like to know what makes you think it difficult to install chainermn.
I use a university server, and don't have root. I simply can't run apt-get
.
Thanks! I'll try later.
FIY, I actually tried conda install openmpi
before but didn't work.
Well, to be clear, I am NOT asking help for installing MPI or setting up chainermn. I would do that in chainermn issue or chainer slack if I ever wanted to do that. This issue is to suggest removing chainermn requirements, because that's not essential for FPN training.
Also, the reason why I don't want to use MPI is not only because I can't set it up. I know I could do that if I spend more time and compile from source. It's because I will not be able to use multiple GPUs most of the time anyway, due to the limited number of GPUs in my lab. Then, MPI will just introduce unnecessary overload when used with one gpu.
Anyway, if the chainercv team decides to keep chainermn dependency, that's fine for me, and you can close the issue. This is just a suggestion from a point of view.
For some examples, we provide both w/o ChainerMN and w/ ChainerMN versions (e.g. examples/ssd/train.py
vs examples/ssd/train_multi.py
). In the case of FPN, my concern is that we can not get enough batchsize with single GPU and the performance will be worse.
I think we have two options.
Note that we face the same problem even if we provide a unified script that supports both w/o ChainerMN and w/ ChainerMN.
Thanks for the comment! I am in favor of option 1. Not everyone has the same environment, and to me, it's acceptable that users have to adjust command line arguments (but not the code) depending on their own situation. Also, I think GPU memory will increase in the future, so the problem will be solved in the long run.
I'd like to suggest another option. Maybe you can call option 3.
How about using gradient accomulation to emulate a large batch size for the single GPU case? I asked at the chainer slack and confirmed that it's possible.
https://chainer.slack.com/archives/C0LC5A6C9/p1555395952007300 https://chainer.slack.com/archives/C0LC5A6C9/p1555396176007900
I mentioned here https://github.com/chainer/chainercv/issues/735#issuecomment-479616802_ before and currently FPN detector is depending on the chainermn. Unfortunately chainermn is not easy to install for those (including me) who are not familiar with server side, so I had manually to remove the dependency....
How about make the chainermn optional?
I attach my ad-hoc coding but I think you can provide something like
My ad-hoc coding:
and then