Closed kurita236 closed 6 years ago
Hos about using examples/ssd/train_multi.py
? https://github.com/chainer/chainercv/blob/master/examples/ssd/train_multi.py
The chainer.training.updaters.MultiprocessParallelUpdater requires a NCCL. I want to run on Windows using the chainer.training.updaters.ParallelUpdater.
First,
Sorry, I got confused. chainer.training.updaters.MultiprocessParallelUpdater
is not multi GPU training, single GPU training.chainer.training.updaters.MultiprocessParallelUpdater
is a multi GPU training, but i have never tried before.
But in the documentation, it says
This is an implementation of Updater that uses multiple GPUs with multi-process data parallelism. It uses Nvidia NCCL for communication between multiple GPUs.
(https://docs.chainer.org/en/stable/reference/generated/chainer.training.updaters.MultiprocessParallelUpdater.html)
Second, you also need NCCL
for multi GPU training with ChainerMN
. (https://chainermn.readthedocs.io/en/stable/installation/guide.html#requirements)
Since ChainerMN and MultiprocessParallelUpdater uses NCCL, those do not work on Windows. I'd like to use the ParallelsUpdate, because I want Multi-GPUs on Windows.
Please ask it on chainer
repo.
System information
Describe the problem
I want to modify the chainercv/examples/ssd/train.py to use Multi-GPUs as chainer/examples/mnist/train_mnist_data_parallel.py. The chainer/examples/mnist/train_mnist_data_parallel.py use chainer.training.ParallelUpdater.
I modified chainercv/examples/ssd/train.py as follows.
111c111,114
115a119,120
127,129d131
154,155c156,157
When I run as follows, An error occurred.
$ python chainercv/examples/ssd/train.py --model ssd300 --batchsize 32 --gpu0 0 --gpu1 1 --out result terminate called after throwing an instance of 'thrust::system::system_error' what(): merge_sort: failed to synchronize: an illegal memory access was encountered
/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from
float
tonp.floating
is deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type
. from ._conv import register_converters as _register_converters epoch iteration lr main/loss main/loss/loc main/loss/conf validation/main/map 0 10 0.001 17.3065 2.95776 14.34870 20 0.001 16.746 2.94444 13.8015
0 30 0.001 28.569 4.0737 24.4953
terminate called after throwing an instance of 'thrust::system::system_error' what(): merge_sort: failed to synchronize: an illegal memory access was encountered (core dumped) python ssd_modify/train_multi_gpu.py --model ssd300 --batchsize 32 --gpu0 0 --gpu0 1 --out result Process ForkPoolWorker-8:imated time to finish: 2 days, 11:11:29.183312. Process ForkPoolWorker-4: Process ForkPoolWorker-12: Process ForkPoolWorker-10: Traceback (most recent call last): File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/pool.py", line 125, in worker put((job, i, result)) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/queues.py", line 347, in put self._writer.send_bytes(obj) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/connection.py", line 404, in _send_bytes self._send(header + buf) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap self.run() File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/pool.py", line 130, in worker put((job, i, (False, wrapped))) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/queues.py", line 347, in put self._writer.send_bytes(obj) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/connection.py", line 404, in _send_bytes self._send(header + buf) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe ...