Got exception during training

stefanCCS commented 2 years ago

I have tried to use calamari-train like this: calamari-train --device.gpus 0 --network deep3 --trainer.epochs 2 --train.batch_size 64 --trainer.output_dir model-01/ --early_stopping.n_to_go 0 --train.images "train/*.png" --val.images "eval/*.png"

During data preparation I have got this exception here below. Hint: My training set is pretty big:

164453 lines for training
8654 lines for eval

My machine has 16 GB RAM (for CPU). and as GPU GeForce RTX 3050 with 8 GB RAM

Any idea?

INFO     2022-08-10 10:11:07,660 calamari_ocr.ocr.dataset.datar: Resolving input files
INFO     2022-08-10 10:13:19,166 calamari_ocr.ocr.dataset.datar: Resolving input files
INFO     2022-08-10 10:13:26,400 tfaip.data.pipeline.datapipeli: Preloading: Converting training to raw pipeline.
Loading samples: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 164453/164453 [06:59<00:00, 392.06it/s]
Applying data processor CenterNormalizerProcessor: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 164453/164453 [01:31<00:00, 1796.59it/s]
Applying data processor FinalPreparation: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 164453/164453 [00:24<00:00, 6785.92it/s]
Applying data processor BidiTextProcessor: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 164453/164453 [00:24<00:00, 6635.26it/s]
Applying data processor StripTextProcessor: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 164453/164453 [00:23<00:00, 6936.43it/s]
Applying data processor TextNormalizerProcessor:  81%|███████████████████████████████████████████████████████████████████████████████████████▍                    | 133225/164453 [00:36<00:23, 1311.53it/s]Process ForkPoolWorker-34:
Process ForkPoolWorker-37:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 127, in worker
    put((job, i, result))
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 127, in worker
    put((job, i, result))
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 364, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 364, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 397, in _send_bytes
    self._send(header)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 398, in _send_bytes
    self._send(buf)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 132, in worker
    put((job, i, (False, wrapped)))
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 364, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
BrokenPipeError: [Errno 32] Broken pipe
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 132, in worker
    put((job, i, (False, wrapped)))
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 364, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Killed

andbue commented 2 years ago

Hi Stefan, really hard to guess what's wrong here. When a GPU runs out of memory, it usually complains about it with an error message. I had this only when trying to train large models in parallel on smaller GPUs. It could be that you're out of ram (should be visible in some system monitor), this could be avoided by streaming the dataset from disk (--train.preload False). Especially when you're using augmentation, >100k samples can be problematic.

stefanCCS commented 2 years ago

Much better with --train.preload False. Now I see this warning here (at begin of each Epoch): WARNING 2022-08-10 13:27:35,278 tfaip.util.multiprocessing.dat: Invalid data. Skipping for the 1. time. What does it mean? (and it looks like it still runs ok...)

btw: I have seen, that GPU memory usage is pretty high: --> so, maybe this really was the reason. Host processor's RAM is pretty ok (about 6 GB out of 16GB).

andbue commented 2 years ago

"Invalid data" could be an image file that is broken/empty/too short for the given text. If the counter does not grow > 100, I would simply ignore that. I don't know if the output of nvidia_smi is helpful here: as far as I remember, tensorflow grabs and reserves what it can get hold of.

stefanCCS commented 2 years ago

"Ignoring" is one the the best things I can do ;-)

Otherwise, I will close this Issue here now as it runs ok, and it does not look like that we really find a root cause ...

Calamari-OCR / calamari

Got exception during training #328