artdevgame / tesseract-trainer

Containerised version of tesseract v4 tools required for training a new font
12 stars 5 forks source link

Failing to train for Japanese #8

Open teo-benavides opened 1 year ago

teo-benavides commented 1 year ago

Trying to use this program to get a model for a Japanese font. I had to get the .traineddata files from here manually and copy them over from my Windows file system to WSL, I don't know why it didn't download them automatically. It apparently recognized them but I end up getting an error. Here's the program log:

[+] Running 1/0
 ✔ Container train-ocr  Created                                                                                    0.0s
Attaching to train-ocr
train-ocr  | Fetching updates from the langdata_lstm repo (be patient, you'll see a message when done)
train-ocr  | From https://github.com/tesseract-ocr/langdata_lstm
train-ocr  |  * branch            HEAD       -> FETCH_HEAD
train-ocr  | Already up to date.
train-ocr  | Fetch complete
train-ocr  |
train-ocr  | === Starting training for language 'jpn'
train-ocr  |
train-ocr  | [Mon 28 Aug 16:58:16 UTC 2023] /usr/local/bin/text2image --fonts_dir=/app/src/fonts --ptsize 12 --font=Silver Medium --outputbase=/tmp/font_tmp.DIHRtH0rjk/sample_text.txt --text=/tmp/font_tmp.DIHRtH0rjk/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.DIHRtH0rjk
train-ocr  | Rendered page 0 to file /tmp/font_tmp.DIHRtH0rjk/sample_text.txt.tif
train-ocr  |
train-ocr  | === Phase I: Generating training images ===
train-ocr  |
train-ocr  | Rendering using Silver Medium
train-ocr  | [Mon 28 Aug 16:58:18 UTC 2023] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.DIHRtH0rjk --fonts_dir=/app/src/fonts --strip_unrenderable_words --leading=32 --xsize=3600 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0 --max_pages=10 --font=Silver Medium --ptsize 12 --text=/app/src/langdata_lstm/jpn/jpn.training_text
train-ocr  | Rendered page 0 to file /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.tif
train-ocr  | Rendered page 1 to file /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.tif
train-ocr  | Rendered page 2 to file /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.tif
train-ocr  | Rendered page 3 to file /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.tif
train-ocr  | Rendered page 4 to file /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.tif
train-ocr  | Rendered page 5 to file /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.tif
train-ocr  | Rendered page 6 to file /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.tif
train-ocr  | Rendered page 7 to file /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.tif
train-ocr  | Rendered page 8 to file /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.tif
train-ocr  | Rendered page 9 to file /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.tif
train-ocr  |
train-ocr  | === Phase UP: Generating unicharset and unichar properties files ===
train-ocr  |
train-ocr  | [Mon 28 Aug 16:58:23 UTC 2023] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/jpn-2023-08-28.m3L/jpn.unicharset --norm_mode 1 /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.box
train-ocr  | Extracting unicharset from box file /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.box
train-ocr  | Mirror 《 of 》 is not in unicharset
train-ocr  | Wrote unicharset file /tmp/jpn-2023-08-28.m3L/jpn.unicharset
train-ocr  | [Mon 28 Aug 16:58:24 UTC 2023] /usr/local/bin/set_unicharset_properties -U /tmp/jpn-2023-08-28.m3L/jpn.unicharset -O /tmp/jpn-2023-08-28.m3L/jpn.unicharset -X /tmp/jpn-2023-08-28.m3L/jpn.xheights --script_dir=/app/src/langdata_lstm
train-ocr  | Loaded unicharset of size 2414 from file /tmp/jpn-2023-08-28.m3L/jpn.unicharset
train-ocr  | Setting unichar properties
train-ocr  | Mirror 《 of 》 is not in unicharset
train-ocr  | Setting script properties
train-ocr  | Writing unicharset to file /tmp/jpn-2023-08-28.m3L/jpn.unicharset
train-ocr  |
train-ocr  | === Phase E: Generating lstmf files ===
train-ocr  |
train-ocr  | Using TESSDATA_PREFIX=/usr/local/share/tessdata
train-ocr  | [Mon 28 Aug 16:58:24 UTC 2023] /usr/local/bin/tesseract /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.tif /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0 --psm 6 lstm.train /app/src/langdata_lstm/jpn/jpn.config
train-ocr  | Error opening data file /usr/local/share/tessdata/jpn_vert.traineddata
train-ocr  | Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
train-ocr  | Failed loading language 'jpn_vert'
train-ocr  | Tesseract Open Source OCR Engine v4.1.3 with Leptonica
train-ocr  | Page 1
train-ocr  | Page 2
train-ocr  | Loaded 52/52 lines (1-52) of document /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.lstmf
train-ocr  | Page 3
train-ocr  | Loaded 104/104 lines (1-104) of document /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.lstmf
train-ocr  | Page 4
train-ocr  | Loaded 156/156 lines (1-156) of document /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.lstmf
train-ocr  | Page 5
train-ocr  | Loaded 208/208 lines (1-208) of document /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.lstmf
train-ocr  | Page 6
train-ocr  | Loaded 260/260 lines (1-260) of document /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.lstmf
train-ocr  | Page 7
train-ocr  | Loaded 312/312 lines (1-312) of document /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.lstmf
train-ocr  | Page 8
train-ocr  | Loaded 364/364 lines (1-364) of document /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.lstmf
train-ocr  | Page 9
train-ocr  | Loaded 416/416 lines (1-416) of document /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.lstmf
train-ocr  | Page 10
train-ocr  | Loaded 468/468 lines (1-468) of document /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.lstmf
train-ocr  |
train-ocr  | === Constructing LSTM training data ===
train-ocr  |
train-ocr  | [Mon 28 Aug 16:58:31 UTC 2023] /usr/local/bin/combine_lang_model --input_unicharset /tmp/jpn-2023-08-28.m3L/jpn.unicharset --script_dir /app/src/langdata_lstm --words /app/src/langdata_lstm/jpn/jpn.wordlist --numbers /app/src/langdata_lstm/jpn/jpn.numbers --puncs /app/src/langdata_lstm/jpn/jpn.punc --output_dir /app/src/train --lang jpn
train-ocr  | Loaded unicharset of size 2414 from file /tmp/jpn-2023-08-28.m3L/jpn.unicharset
train-ocr  | Setting unichar properties
train-ocr  | Mirror 《 of 》 is not in unicharset
train-ocr  | Setting script properties
train-ocr  | Config file is optional, continuing...
train-ocr  | Null char=2
train-ocr  | Reducing Trie to SquishedDawg
train-ocr  | Reducing Trie to SquishedDawg
train-ocr  | Reducing Trie to SquishedDawg
train-ocr  |
train-ocr  | === Saving box/tiff pairs for training data ===
train-ocr  |
train-ocr  | Moving /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.box to /app/src/train
train-ocr  | Moving /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.tif to /app/src/train
train-ocr  |
train-ocr  | === Moving lstmf files for training data ===
train-ocr  |
train-ocr  | Moving /tmp/jpn-2023-08-28.m3L/jpn.Silver_Medium.exp0.lstmf to /app/src/train
train-ocr  |
train-ocr  | Created starter traineddata for LSTM training of language 'jpn'
train-ocr  |
train-ocr  |
train-ocr  |
train-ocr  | Run 'lstmtraining' command to continue LSTM training for language 'jpn'
train-ocr  |
train-ocr  |
train-ocr  | Failed to read /usr/local/share/tessdata/jpn.traineddata
train-ocr  | /app/src/train/jpn.lstm is not a recognition model, trying training checkpoint...
train-ocr  | Failed to load language model from /usr/local/share/tessdata/jpn.traineddata!
train-ocr  | mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110
train-ocr  | Illegal instruction
train-ocr  | mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110
train-ocr  | Illegal instruction
train-ocr  | /app/src/output/_checkpoint is not a recognition model, trying training checkpoint...
train-ocr  | Failed to load language model from /usr/local/share/tessdata/jpn.traineddata!
train-ocr exited with code 1
AhmadHakami commented 10 months ago

same problem with me

@artdevgame @Malkiz223 > Can you please try to figure out what the reason for this error is and how it can be fixed?

teo-benavides commented 10 months ago

same problem with me

@artdevgame @Malkiz223 > Can you please try to figure out what the reason for this error is and how it can be fixed?

For what it's worth, I took some notes on how to train a .traineddata based on a specific font. I completely forgot how this works or what anything means so I can't answer any questions, you're gonna have to figure out some stuff yourself probably. Link: https://clear-freighter-3eb.notion.site/How-to-create-a-traineddata-file-for-a-specific-Japanese-font-40b7be9822d0421dad0c79cf031ab40c?pvs=4