Failure when Running Training Model on AWS Batch

OElesin commented 5 years ago

I was able to follow the tutorial and reproduce the model and results for my use case. However, when I schedule the model training on AWS Batch, EC2 instance m4.4xlarge, it fails with the error below while extracting features. See line

for i, (data, label) in enumerate(data_loader):
        data = data.as_in_context(ctx)
        if i % n_print == 0 and i > 0:
            print(
                "{0} batches, {1} images, {2:.3f} img/sec".format(
                    i, i*BATCH_SIZE, BATCH_SIZE*n_print/(time.time()-tick)
                )
            )
            tick = time.time()
        output = net(data)
        features[i * BATCH_SIZE:(i+1)*max(BATCH_SIZE, len(output)), :] = output.asnumpy().squeeze()

Error message

save(x)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/multiprocessing/forking.py", line 66, in dispatcher
rv = reduce(obj)
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 43, in reduce_ndarray
return rebuild_ndarray, data._to_shared_mem()
File "/usr/local/lib/python2.7/dist-packages/mxnet/ndarray/ndarray.py", line 200, in _to_shared_mem
self.handle, ctypes.byref(shared_pid), ctypes.byref(shared_id)))
File "/usr/local/lib/python2.7/dist-packages/mxnet/base.py", line 149, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [14:48:14] src/operator/tensor/../tensor/elemwise_unary_op.h:301: Check failed: inputs[0].dptr_ == outputs[0].dptr_ (0x7fe0beffc040 vs. 0x7fe0bf001600)
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x17ec9d) [0x7fe11ec74c9d]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x17f068) [0x7fe11ec75068]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x8f7034) [0x7fe11f3ed034]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x2825020) [0x7fe12131b020]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x27a3ad8) [0x7fe121299ad8]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x27a3b13) [0x7fe121299b13]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x27ab954) [0x7fe1212a1954]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x27af461) [0x7fe1212a5461]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x27ac01b) [0x7fe1212a201b]
[bt] (9) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7fe130197c80]

I have tried to figure this out but nothing so far.

Please help if you have any ideas.

ThomasDelteil commented 5 years ago

This seems to be an issue with the shared memory. Try setting num_workers=0 for your data loader ?

OElesin commented 5 years ago

Works when I did set num_workers=0. However, training seemed a bit slow. Is there a what to improve this even with the shared memory constraint?

Thanks

ThomasDelteil commented 5 years ago

Try setting thread_pool=True and num_workers > 0, see how faster you get.

ThomasDelteil commented 5 years ago

@OElesin any update on this, how did it go? Please reopen if you need further assistance

OElesin commented 5 years ago

Sorry for late update. thread_pool keyword was not available in the version of MxNet being used. Which is kind surprising.

ThomasDelteil / VisualSearch_MXNet

Failure when Running Training Model on AWS Batch #8