Calamari-OCR / calamari

Line based ATR Engine based on OCRopy
GNU General Public License v3.0
1.05k stars 209 forks source link

Issue on CTC loss when training on new data #66

Closed realjoenguyen closed 4 years ago

realjoenguyen commented 5 years ago

HI,

When training Calamari on my dataset, I got this error tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found.

Can you help me? Thank you

ChWick commented 5 years ago

Usually, that means that something is wrong with your data. Can you

realjoenguyen commented 5 years ago

Hi there,

  1. This is the output when I run calamari:
#0.000000: loss=475.23452759 ler=1.92407715 dt=80.18787289s
 PRED: '‪ÉWÉ.ọồSỮgỢọẰ&ĨọWọụọừọĨọSỢọĨụSứỰòọỢỘẠọWĨòĨẰÙọừỘỪụẰĨừẰụỢỘĨŨgEọiĨVẰừẰỎẲŨọẰọĨV58ừVỄụŨ‬'
 TRUE: '‪Địa chỉ: Trần Hưng Đạo, Phường Lê Bình, Quận Cái Răng, Cần Thơ‬'
2019-02-13 03:30:22.041879: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found.
WARNING: Infinite loss. Skipping batch.

So there are 2 different samples here.

  1. My training data contains 60.000 samples and I get the small data (random 1000 samples from these 60.000 training samples). It still got the same error. Here is the link: ("files" for images and "labels" for OCR text label) https://drive.google.com/drive/folders/1E2L8D7ZtrGQi7zOLeTLZSRbfOMdNbVF1?usp=sharing

  2. here is the command:

!/usr/bin/env bash

IMAGE_DIR=./data/small/files/ LABEL_DIR=./data/small/labels/

python3 ./calamari_ocr/scripts/train.py \ --files "${IMAGE_DIR}" \ --text_files "${LABEL_DIR}" \ --num_threads 8 \ --batch_size 10 \ --display 1 \ --output_dir ./out \ --checkpoint_frequency 100 \ --train_data_on_the_fly



Thank you so much for your help!!!
ChWick commented 5 years ago

Thanks for the provided information!

I havn't found the reason for the warning, yet, however I was able to successfully train a model on your provided data (and ignoring the warning). Probably the display parameter does not what you expected: When setting to a value in [0,1] the output is shown relatively to an epoch, I guess what you want is to display each iteration (this is not possible), but you can set it to a number greater one to see the learning progress, e. g. --display 10. Moreover, check if you really need the --train_data_on_the_fly parameter since it slows down the computation really hard (60K examples should fit in the RAM completely).

realjoenguyen commented 5 years ago

But this:

2019-02-13 03:30:22.041879: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found. is not the error? Because I often see "skipping batch" , so I think the batch is skipped and the training process got errors?

Thank you for your help!

Edited: did you see the warning and "no valid path found" in CTC when using my dataset?

realjoenguyen commented 5 years ago

Also when I use --train_data_on_the_fly, I got this error:


Resolving input files
Found 60000 files in the dataset
datset = <calamari_ocr.ocr.datasets.file_dataset.FileDataSet object at 0x7fd6046751d0>
Preloading dataset type DataSetMode.TRAIN with size 60000
Loading Dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 60000/60000 [07:59<00:00, 125.26it/s]
Traceback (most recent call last):
  File "./calamari_ocr/scripts/train.py", line 315, in <module>
    main()
  File "./calamari_ocr/scripts/train.py", line 311, in main
    run(args)
  File "./calamari_ocr/scripts/train.py", line 299, in run
    progress_bar=not args.no_progress_bars
  File "/root/TA/calamari/calamari_ocr/ocr/trainer.py", line 112, in train
    self.dataset.preload(processes=checkpoint_params.processes, progress_bar=progress_bar)
  File "/root/TA/calamari/calamari_ocr/ocr/datasets/input_dataset.py", line 60, in preload
    texts = self.text_processor.apply(txts, processes=processes, progress_bar=progress_bar)
  File "/root/TA/calamari/calamari_ocr/ocr/text_processing/text_processor.py", line 17, in apply
    return parallel_map(self._apply_single, txts, desc="Text Preprocessing", processes=processes, progress_bar=progress_bar)
  File "/root/TA/calamari/calamari_ocr/utils/multiprocessing.py", line 40, in parallel_map
    with multiprocessing.Pool(processes=processes, maxtasksperchild=max_tasks_per_child) as pool:
  File "/usr/lib/python3.5/multiprocessing/context.py", line 118, in Pool
    context=self.get_context())
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 168, in __init__
    self._repopulate_pool()
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 233, in _repopulate_pool
    w.start()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 67, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
ChWick commented 5 years ago

Edited: did you see the warning and "no valid path found" in CTC when using my dataset?

Yes, my output (using --display 10)

#00000000: loss=516.87536621 ler=1.88759577 dt=44.58602881s
 PRED: '‪É&Ợ5Ù95ọ5ẠọẰừọSọỢỰẰĩgầVỢĨMọSIẰụẰŨĨ&ỚSĨòỢừỘẰọĨừọSỢọĨừọỢọSĨỘọừòSậWĨŨỘọẠọĨẰĨỐĨ9Ũ‬'
 TRUE: '‪Địa chỉ: Trần Hưng Đạo, Phường Lê Bình, Quận Cái Răng, Cần Thơ‬'
#00000010: loss=336.84377636 ler=1.15732210 dt=4.88046891s
 PRED: '‪‬'
 TRUE: '‪Địa chỉ: ấp Long Bình, Xã Long Điền A, Huyện Chợ Mới, An Giang‬'
2019-02-13 09:49:08.596539: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found.
WARNING: Infinite loss. Skipping batch.
#00000020: loss=260.75162315 ler=1.04867650 dt=3.11120788s
 PRED: '‪T H H‬'
 TRUE: '‪Mã số thuế: CÔNG TY TNHH DỊCH VỤ VÀ ĐÀO TẠO TRÍ ĐỨC‬'
#00000030: loss=229.56856613 ler=1.00270691 dt=2.37745589s
 PRED: '‪‬'
 TRUE: '‪Mã số thuế: DOANH NGHIỆP TƯ NHÂN THÚY TÀI‬'
#00000040: loss=213.85240898 ler=0.98507622 dt=2.02556430s
 PRED: '‪‬'
 TRUE: '‪Địa chỉ: 22/3/2 Phú Mộng, Phường Kim Long, Thành phố Huế, Thừa Thiên Huế.‬'

The error means that in a single iteration an error occurred during the computation of the loss/gradients, which is why this single iteration batch is ignored, i.e. the weights are not updated. It is shown approximately every 40-50 iterations which means that 1 out of 400-500 files (batch size 10) is probably corrupted and thus ignored.

ChWick commented 5 years ago

Also when I use --train_data_on_the_fly, I got this error:

For the smaller dataset I require approx 3 GB RAM for training when loading into the RAM. Probably the whole dataset is too large to fit completely in the RAM which is why you have to use `--train_data_on_the_fly' here no more than 4 GB RAM should be required.

Please test if the number of files are the reason for this OutOfMemory error.

realjoenguyen commented 5 years ago

Thank you, Can you suggest how I can avoid "No valid path" in CTC loss?

ChWick commented 5 years ago

As I said, usually this is a mismatch of a single GT (text, image)-pair. Most probably a line that is corrupt (e. g. an image that is rotated by 90deg, or completely white, ...). Unfortunately, I did not find such a line when scrolling through your data.

An idea to find the 'bad' files:

thak123 commented 5 years ago

@gofortargets how did you generate the dataset for training ?

realjoenguyen commented 5 years ago

From here: https://github.com/Belval/TextRecognitionDataGenerator

On Fri, Apr 5, 2019 at 4:53 PM Gaurish Thakkar notifications@github.com wrote:

@gofortargets https://github.com/gofortargets how did you generate the dataset for training ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Calamari-OCR/calamari/issues/66#issuecomment-480217513, or mute the thread https://github.com/notifications/unsubscribe-auth/AZOpGVX0SpYwawrDTQn9kJ-kMMt5ioPlks5vdx0ngaJpZM4a2hih .

-- Nguyễn Tuấn Anh ĐT: 0898130931

srikanthsampathi commented 4 years ago

how to stop training , number of itr= 8790

srikanthsampathi commented 4 years ago

loss=0.55423099 ler=0.15204727 dt=4.05446383s is reached with itr 8790 , when will the training be stopped or should I stop manually and use the model number for prediction?

ChWick commented 4 years ago

@srikanthsampathi Three options: