Closed Angzz closed 6 years ago
@Angzz I found python2 do have some stability issue during training. I personally always use python3 for training, and do not have issues you reported.
Let me know if it still exists after https://github.com/apache/incubator-mxnet/pull/11908
1.when I run ssd on a single gpu, I encounter a problem like this:
INFO:root:Namespace(batch_size=15, data_shape=512, dataset='voc', epochs=240, gpus='0', log_interval=100, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,200', momentum=0.9, network='resnet50_v1', num_workers=32, resume='', save_interval=10, save_prefix='ssd_512_resnet50_v1_voc', seed=233, start_epoch=0, val_interval=1, wd=0.0005) INFO:root:Start training from [Epoch 0] [02:21:20] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) INFO:root:[Epoch 0][Batch 99], Speed: 37.221 samples/sec, CrossEntropy=7.690, SmoothL1=3.245 INFO:root:[Epoch 0][Batch 199], Speed: 36.276 samples/sec, CrossEntropy=6.458, SmoothL1=3.105 INFO:root:[Epoch 0][Batch 299], Speed: 38.109 samples/sec, CrossEntropy=5.929, SmoothL1=2.991 Traceback (most recent call last): File "scripts/ssd/train_ssd.py", line 259, in
train(net, train_data, val_data, eval_metric, args)
File "scripts/ssd/train_ssd.py", line 192, in train
for i, batch in enumerate(train_data):
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 222, in next
return self.next()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 218, in next
idx, batch = self._data_queue.get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 117, in get
res = self._recv()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 88, in recv
return pickle.loads(buf)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(args)
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 53, in rebuild_ndarray
fd = multiprocessing.reduction.rebuild_handle(fd)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 308, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(args)
socket.error: [Errno 111] Connection refused
2.when I run it on multi gpus (like 2), the above error is not solved, and I find although you use twice batch_size compared with single gpu, the training samples/sec is not twice, in fact about 43 ~ 48samples/sec, I think this is not reasonable, imply the training process is not stable
3.the
resume
still not works in gluoncv, and I think it very inconvenient to train from 0 epoch each time.Can you give me some suggestions? thx a lot!