Closed realjoenguyen closed 4 years ago
Usually, that means that something is wrong with your data. Can you
Hi there,
#0.000000: loss=475.23452759 ler=1.92407715 dt=80.18787289s
PRED: 'ÉWÉ.ọồSỮgỢọẰ&ĨọWọụọừọĨọSỢọĨụSứỰòọỢỘẠọWĨòĨẰÙọừỘỪụẰĨừẰụỢỘĨŨgEọiĨVẰừẰỎẲŨọẰọĨV58ừVỄụŨ'
TRUE: 'Địa chỉ: Trần Hưng Đạo, Phường Lê Bình, Quận Cái Răng, Cần Thơ'
2019-02-13 03:30:22.041879: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found.
WARNING: Infinite loss. Skipping batch.
So there are 2 different samples here.
My training data contains 60.000 samples and I get the small data (random 1000 samples from these 60.000 training samples). It still got the same error. Here is the link: ("files" for images and "labels" for OCR text label) https://drive.google.com/drive/folders/1E2L8D7ZtrGQi7zOLeTLZSRbfOMdNbVF1?usp=sharing
here is the command:
IMAGE_DIR=./data/small/files/ LABEL_DIR=./data/small/labels/
python3 ./calamari_ocr/scripts/train.py \ --files "${IMAGE_DIR}" \ --text_files "${LABEL_DIR}" \ --num_threads 8 \ --batch_size 10 \ --display 1 \ --output_dir ./out \ --checkpoint_frequency 100 \ --train_data_on_the_fly
Thank you so much for your help!!!
Thanks for the provided information!
I havn't found the reason for the warning, yet, however I was able to successfully train a model on your provided data (and ignoring the warning). Probably the display
parameter does not what you expected: When setting to a value in [0,1] the output is shown relatively to an epoch, I guess what you want is to display each iteration (this is not possible), but you can set it to a number greater one to see the learning progress, e. g. --display 10
.
Moreover, check if you really need the --train_data_on_the_fly
parameter since it slows down the computation really hard (60K examples should fit in the RAM completely).
But this:
2019-02-13 03:30:22.041879: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found.
is not the error?
Because I often see "skipping batch" , so I think the batch is skipped and the training process got errors?
Thank you for your help!
Edited: did you see the warning and "no valid path found" in CTC when using my dataset?
Also when I use --train_data_on_the_fly
, I got this error:
Resolving input files
Found 60000 files in the dataset
datset = <calamari_ocr.ocr.datasets.file_dataset.FileDataSet object at 0x7fd6046751d0>
Preloading dataset type DataSetMode.TRAIN with size 60000
Loading Dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 60000/60000 [07:59<00:00, 125.26it/s]
Traceback (most recent call last):
File "./calamari_ocr/scripts/train.py", line 315, in <module>
main()
File "./calamari_ocr/scripts/train.py", line 311, in main
run(args)
File "./calamari_ocr/scripts/train.py", line 299, in run
progress_bar=not args.no_progress_bars
File "/root/TA/calamari/calamari_ocr/ocr/trainer.py", line 112, in train
self.dataset.preload(processes=checkpoint_params.processes, progress_bar=progress_bar)
File "/root/TA/calamari/calamari_ocr/ocr/datasets/input_dataset.py", line 60, in preload
texts = self.text_processor.apply(txts, processes=processes, progress_bar=progress_bar)
File "/root/TA/calamari/calamari_ocr/ocr/text_processing/text_processor.py", line 17, in apply
return parallel_map(self._apply_single, txts, desc="Text Preprocessing", processes=processes, progress_bar=progress_bar)
File "/root/TA/calamari/calamari_ocr/utils/multiprocessing.py", line 40, in parallel_map
with multiprocessing.Pool(processes=processes, maxtasksperchild=max_tasks_per_child) as pool:
File "/usr/lib/python3.5/multiprocessing/context.py", line 118, in Pool
context=self.get_context())
File "/usr/lib/python3.5/multiprocessing/pool.py", line 168, in __init__
self._repopulate_pool()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 233, in _repopulate_pool
w.start()
File "/usr/lib/python3.5/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 67, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
Edited: did you see the warning and "no valid path found" in CTC when using my dataset?
Yes, my output (using --display 10
)
#00000000: loss=516.87536621 ler=1.88759577 dt=44.58602881s
PRED: 'É&Ợ5Ù95ọ5ẠọẰừọSọỢỰẰĩgầVỢĨMọSIẰụẰŨĨ&ỚSĨòỢừỘẰọĨừọSỢọĨừọỢọSĨỘọừòSậWĨŨỘọẠọĨẰĨỐĨ9Ũ'
TRUE: 'Địa chỉ: Trần Hưng Đạo, Phường Lê Bình, Quận Cái Răng, Cần Thơ'
#00000010: loss=336.84377636 ler=1.15732210 dt=4.88046891s
PRED: ''
TRUE: 'Địa chỉ: ấp Long Bình, Xã Long Điền A, Huyện Chợ Mới, An Giang'
2019-02-13 09:49:08.596539: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found.
WARNING: Infinite loss. Skipping batch.
#00000020: loss=260.75162315 ler=1.04867650 dt=3.11120788s
PRED: 'T H H'
TRUE: 'Mã số thuế: CÔNG TY TNHH DỊCH VỤ VÀ ĐÀO TẠO TRÍ ĐỨC'
#00000030: loss=229.56856613 ler=1.00270691 dt=2.37745589s
PRED: ''
TRUE: 'Mã số thuế: DOANH NGHIỆP TƯ NHÂN THÚY TÀI'
#00000040: loss=213.85240898 ler=0.98507622 dt=2.02556430s
PRED: ''
TRUE: 'Địa chỉ: 22/3/2 Phú Mộng, Phường Kim Long, Thành phố Huế, Thừa Thiên Huế.'
The error means that in a single iteration an error occurred during the computation of the loss/gradients, which is why this single iteration batch is ignored, i.e. the weights are not updated. It is shown approximately every 40-50 iterations which means that 1 out of 400-500 files (batch size 10) is probably corrupted and thus ignored.
Also when I use
--train_data_on_the_fly
, I got this error:
For the smaller dataset I require approx 3 GB RAM for training when loading into the RAM. Probably the whole dataset is too large to fit completely in the RAM which is why you have to use `--train_data_on_the_fly' here no more than 4 GB RAM should be required.
Please test if the number of files are the reason for this OutOfMemory
error.
Thank you, Can you suggest how I can avoid "No valid path" in CTC loss?
As I said, usually this is a mismatch of a single GT (text, image)-pair. Most probably a line that is corrupt (e. g. an image that is rotated by 90deg, or completely white, ...). Unfortunately, I did not find such a line when scrolling through your data.
An idea to find the 'bad' files:
@gofortargets how did you generate the dataset for training ?
From here: https://github.com/Belval/TextRecognitionDataGenerator
On Fri, Apr 5, 2019 at 4:53 PM Gaurish Thakkar notifications@github.com wrote:
@gofortargets https://github.com/gofortargets how did you generate the dataset for training ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Calamari-OCR/calamari/issues/66#issuecomment-480217513, or mute the thread https://github.com/notifications/unsubscribe-auth/AZOpGVX0SpYwawrDTQn9kJ-kMMt5ioPlks5vdx0ngaJpZM4a2hih .
-- Nguyễn Tuấn Anh ĐT: 0898130931
how to stop training , number of itr= 8790
loss=0.55423099 ler=0.15204727 dt=4.05446383s is reached with itr 8790 , when will the training be stopped or should I stop manually and use the model number for prediction?
@srikanthsampathi Three options:
--max_iters=10000
--validation VALIDATION_DATASET
HI,
When training Calamari on my dataset, I got this error tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found.
Can you help me? Thank you