awslabs / handwritten-text-recognition-for-apache-mxnet

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.
Apache License 2.0
481 stars 189 forks source link

ConnectionResetError at word segmentation #40

Closed jbuehler1337 closed 4 years ago

jbuehler1337 commented 4 years ago

Hi, I already mentioned my problem but I didin't find an issue describing what I experience at the moment. When I start the 2_line_word_segmentation.ipynb I get the following error:

ConnectionResetErrorTraceback (most recent call last)
<ipython-input-13-fbd64d2ad138> in <module>
      3     cls_metric = mx.metric.Accuracy()
      4     box_metric = mx.metric.MAE()
----> 5     train_loss = run_epoch(e, net, train_data, trainer, log_dir, print_name="train", is_train=True, update_metric=False)
      6     test_loss = run_epoch(e, net, test_data, trainer, log_dir, print_name="test", is_train=False, update_metric=True)
      7     if test_loss < best_test_loss:

<ipython-input-6-6b90c6f2ae19> in run_epoch(e, network, dataloader, trainer, log_dir, print_name, is_train, update_metric)
     32 
     33     total_losses = [0 for ctx_i in ctx]
---> 34     for i, (X, Y) in enumerate(dataloader):
     35         X = gluon.utils.split_and_load(X, ctx)
     36         Y = gluon.utils.split_and_load(Y, ctx)

/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py in __next__(self)
    503         try:
    504             if self._dataset is None:
--> 505                 batch = pickle.loads(ret.get(self._timeout))
    506             else:
    507                 batch = ret.get(self._timeout)

/usr/local/lib/python3.6/dist-packages/mxnet/gluon/data/dataloader.py in rebuild_ndarray(pid, fd, shape, dtype)
     59             fd = multiprocessing.reduction.rebuild_handle(fd)
     60         else:
---> 61             fd = fd.detach()
     62         return nd.NDArray(nd.ndarray._new_from_shared_mem(pid, fd, shape, dtype))
     63 

/usr/lib/python3.6/multiprocessing/resource_sharer.py in detach(self)
     55         def detach(self):
     56             '''Get the fd.  This should only be called once.'''
---> 57             with _resource_sharer.get_connection(self._id) as conn:
     58                 return reduction.recv_handle(conn)
     59 

/usr/lib/python3.6/multiprocessing/resource_sharer.py in get_connection(ident)
     85         from .connection import Client
     86         address, key = ident
---> 87         c = Client(address, authkey=process.current_process().authkey)
     88         c.send((key, os.getpid()))
     89         return c

/usr/lib/python3.6/multiprocessing/connection.py in Client(address, family, authkey)
    491 
    492     if authkey is not None:
--> 493         answer_challenge(c, authkey)
    494         deliver_challenge(c, authkey)
    495 

/usr/lib/python3.6/multiprocessing/connection.py in answer_challenge(connection, authkey)
    730     import hmac
    731     assert isinstance(authkey, bytes)
--> 732     message = connection.recv_bytes(256)         # reject large message
    733     assert message[:len(CHALLENGE)] == CHALLENGE, 'message = %r' % message
    734     message = message[len(CHALLENGE):]

/usr/lib/python3.6/multiprocessing/connection.py in recv_bytes(self, maxlength)
    214         if maxlength is not None and maxlength < 0:
    215             raise ValueError("negative maxlength")
--> 216         buf = self._recv_bytes(maxlength)
    217         if buf is None:
    218             self._bad_message_length()

/usr/lib/python3.6/multiprocessing/connection.py in _recv_bytes(self, maxsize)
    405 
    406     def _recv_bytes(self, maxsize=None):
--> 407         buf = self._recv(4)
    408         size, = struct.unpack("!i", buf.getvalue())
    409         if maxsize is not None and size > maxsize:

/usr/lib/python3.6/multiprocessing/connection.py in _recv(self, size, read)
    377         remaining = size
    378         while remaining > 0:
--> 379             chunk = read(handle, remaining)
    380             n = len(chunk)
    381             if n == 0:

ConnectionResetError: [Errno 104] Connection reset by peer

I am using a Docker Image on a Linux system. Can you help me to get the notebook to run?

jbuehler1337 commented 4 years ago

Hey, I just ran through all notebooks without an error. Pickle files are generated properly but I am still getting that error: ConnectionResetError: [Errno 104] Connection reset by peer

jbuehler1337 commented 4 years ago

Hey again. I solved that problem. It was the num_workers of 2 was to high. I put the num of workers to 1 and it works fine.

jonomon commented 4 years ago

Great!