dmlc / gluon-cv

Gluon CV Toolkit
http://gluon-cv.mxnet.io
Apache License 2.0
5.82k stars 1.21k forks source link

socket.error in gluoncv ssd #197

Closed Angzz closed 6 years ago

Angzz commented 6 years ago

1.when I run ssd on a single gpu, I encounter a problem like this:

INFO:root:Namespace(batch_size=15, data_shape=512, dataset='voc', epochs=240, gpus='0', log_interval=100, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,200', momentum=0.9, network='resnet50_v1', num_workers=32, resume='', save_interval=10, save_prefix='ssd_512_resnet50_v1_voc', seed=233, start_epoch=0, val_interval=1, wd=0.0005) INFO:root:Start training from [Epoch 0] [02:21:20] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) INFO:root:[Epoch 0][Batch 99], Speed: 37.221 samples/sec, CrossEntropy=7.690, SmoothL1=3.245 INFO:root:[Epoch 0][Batch 199], Speed: 36.276 samples/sec, CrossEntropy=6.458, SmoothL1=3.105 INFO:root:[Epoch 0][Batch 299], Speed: 38.109 samples/sec, CrossEntropy=5.929, SmoothL1=2.991 Traceback (most recent call last): File "scripts/ssd/train_ssd.py", line 259, in train(net, train_data, val_data, eval_metric, args) File "scripts/ssd/train_ssd.py", line 192, in train for i, batch in enumerate(train_data): File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 222, in next return self.next() File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 218, in next idx, batch = self._data_queue.get() File "/usr/lib/python2.7/multiprocessing/queues.py", line 117, in get res = self._recv() File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 88, in recv return pickle.loads(buf) File "/usr/lib/python2.7/pickle.py", line 1388, in loads return Unpickler(file).load() File "/usr/lib/python2.7/pickle.py", line 864, in load dispatchkey File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce value = func(args) File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 53, in rebuild_ndarray fd = multiprocessing.reduction.rebuild_handle(fd) File "/usr/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle conn = Client(address, authkey=current_process().authkey) File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client c = SocketClient(address) File "/usr/lib/python2.7/multiprocessing/connection.py", line 308, in SocketClient s.connect(address) File "/usr/lib/python2.7/socket.py", line 228, in meth return getattr(self._sock,name)(args) socket.error: [Errno 111] Connection refused

2.when I run it on multi gpus (like 2), the above error is not solved, and I find although you use twice batch_size compared with single gpu, the training samples/sec is not twice, in fact about 43 ~ 48samples/sec, I think this is not reasonable, imply the training process is not stable

3.the resume still not works in gluoncv, and I think it very inconvenient to train from 0 epoch each time.

Can you give me some suggestions? thx a lot!

zhreshold commented 6 years ago

@Angzz I found python2 do have some stability issue during training. I personally always use python3 for training, and do not have issues you reported.

zhreshold commented 6 years ago

Let me know if it still exists after https://github.com/apache/incubator-mxnet/pull/11908