chainer / chainercv

ChainerCV: a Library for Deep Learning in Computer Vision
MIT License
1.48k stars 304 forks source link

modify the chainercv/examples/ssd/train.py to use Multi-GPUs. #691

Closed kurita236 closed 6 years ago

kurita236 commented 6 years ago

System information

Describe the problem

I want to modify the chainercv/examples/ssd/train.py to use Multi-GPUs as chainer/examples/mnist/train_mnist_data_parallel.py. The chainer/examples/mnist/train_mnist_data_parallel.py use chainer.training.ParallelUpdater.

I modified chainercv/examples/ssd/train.py as follows.

111c111,114

< parser.add_argument('--gpu', type=int, default=-1)

> parser.add_argument('--gpu0', '-g', type=int, default=0, > help='First GPU ID') > parser.add_argument('--gpu1', '-G', type=int, default=1, > help='Second GPU ID')

115a119,120

< if args.model == 'ssd300': < model = SSD300( < n_fg_class=len(voc_bbox_label_names), < pretrained_model='imagenet') < elif args.model == 'ssd512': < model = SSD512( < n_fg_class=len(voc_bbox_label_names), < pretrained_model='imagenet') < < model.use_preset('evaluate') < train_chain = MultiboxTrainChain(model) < if args.gpu >= 0: < chainer.cuda.get_device_from_id(args.gpu).use() < model.to_gpu()

127,129d131

> chainer.cuda.get_device_from_id(args.gpu0).use() > > if args.model == 'ssd300': > model = SSD300( > n_fg_class=len(voc_bbox_label_names), > pretrained_model='imagenet') > elif args.model == 'ssd512': > model = SSD512( > n_fg_class=len(voc_bbox_label_names), > pretrained_model='imagenet') > > model.use_preset('evaluate') > train_chain = MultiboxTrainChain(model)

154,155c156,157

< updater = training.updaters.StandardUpdater( < train_iter, optimizer, device=args.gpu)

> updater = training.updaters.ParallelUpdater( > train_iter, optimizer, devices={'main': args.gpu0, 'second': args.gpu1})

When I run as follows, An error occurred.

$ python chainercv/examples/ssd/train.py --model ssd300 --batchsize 32 --gpu0 0 --gpu1 1 --out result terminate called after throwing an instance of 'thrust::system::system_error' what(): merge_sort: failed to synchronize: an illegal memory access was encountered

/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters epoch iteration lr main/loss main/loss/loc main/loss/conf validation/main/map 0 10 0.001 17.3065 2.95776 14.3487
0 20 0.001 16.746 2.94444 13.8015
0 30 0.001 28.569 4.0737 24.4953
terminate called after throwing an instance of 'thrust::system::system_error' what(): merge_sort: failed to synchronize: an illegal memory access was encountered (core dumped) python ssd_modify/train_multi_gpu.py --model ssd300 --batchsize 32 --gpu0 0 --gpu0 1 --out result Process ForkPoolWorker-8:imated time to finish: 2 days, 11:11:29.183312. Process ForkPoolWorker-4: Process ForkPoolWorker-12: Process ForkPoolWorker-10: Traceback (most recent call last): File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/pool.py", line 125, in worker put((job, i, result)) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/queues.py", line 347, in put self._writer.send_bytes(obj) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/connection.py", line 404, in _send_bytes self._send(header + buf) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap self.run() File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/pool.py", line 130, in worker put((job, i, (False, wrapped))) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/queues.py", line 347, in put self._writer.send_bytes(obj) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/connection.py", line 404, in _send_bytes self._send(header + buf) File "/opt/pyenv/versions/anaconda3-5.2.0/envs/chainer/lib/python3.5/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe ...

knorth55 commented 6 years ago

Hos about using examples/ssd/train_multi.py? https://github.com/chainer/chainercv/blob/master/examples/ssd/train_multi.py

kurita236 commented 6 years ago

The chainer.training.updaters.MultiprocessParallelUpdater requires a NCCL. I want to run on Windows using the chainer.training.updaters.ParallelUpdater.

knorth55 commented 6 years ago

First, chainer.training.updaters.MultiprocessParallelUpdater is not multi GPU training, single GPU training. Sorry, I got confused. chainer.training.updaters.MultiprocessParallelUpdater is a multi GPU training, but i have never tried before. But in the documentation, it says

This is an implementation of Updater that uses multiple GPUs with multi-process data parallelism. It uses Nvidia NCCL for communication between multiple GPUs.

(https://docs.chainer.org/en/stable/reference/generated/chainer.training.updaters.MultiprocessParallelUpdater.html) Second, you also need NCCL for multi GPU training with ChainerMN. (https://chainermn.readthedocs.io/en/stable/installation/guide.html#requirements)

kurita236 commented 6 years ago

Since ChainerMN and MultiprocessParallelUpdater uses NCCL, those do not work on Windows. I'd like to use the ParallelsUpdate, because I want Multi-GPUs on Windows.

knorth55 commented 6 years ago

Please ask it on chainer repo.