amitdo / text2tif-2016

A fork of Tesseract's text2image program
Other
0 stars 0 forks source link

Testing text2tif on Windows with Cygwin #2

Open amitdo opened 8 years ago

amitdo commented 8 years ago

Someone needs to test it...

amitdo commented 8 years ago

Needed to build leptonica and also download development packages for pango, cairo etc with c++ bindings

The dependencies are the same as Tesseract's training tools (Tesseract itself is not needed).

When you provide an output, please mark the output blocks with the mouse/keyboard and then press the 'insert code' button above the comment's text editing area.

Shreeshrii commented 8 years ago

Thanks Amit for the tip regarding 'insert code'.

There is one error.

In file included from /usr/include/stdlib.h:11:0,
                 from ./training/pango_font_info.cpp:30:
/usr/include/string.h:76:7: error: conflicting declaration of ‘char* strcasestr(const char*, const char*)’ with ‘C’ linkage
 char *_EXFUN(strcasestr,(const char *, const char *));
       ^
Shreeshrii commented 8 years ago

Compiled ok.

ra@Shree ~/tesseract-ocr/text2tif
$ ./text2tif
Text file missing!
!FLAGS_text.empty():Error:Assert failed:in file ./training/text2image.cpp, line 427
Segmentation fault (core dumped)
amitdo commented 8 years ago

I pushed a new commit, please check that it did not break anything.

Shreeshrii commented 8 years ago

compiled ok.

Please see Issue https://github.com/amitdo/text2tif/issues/5

ra@Shree ~/tesseract-ocr/text2tif
$ ./text2tif --fonts_dir= --list_available_fonts

(process:8744): Pango-CRITICAL **: pango_font_description_set_size: assertion 'size >= 0' failed
  0: 8514fix

(process:8744): Pango-CRITICAL **: pango_font_description_set_size: assertion 'size >= 0' failed
  1: 8514fix Bold

It does list the fonts, but with the pango messages coming in between also.

amitdo commented 8 years ago

1) Did these messages appear with the previous commit? 2) Do these messages appear with Tesseract ?

Shreeshrii commented 8 years ago
  1. I had not tested this with previous commit. if you let me know the commands to roll back and compile again, I can test that.
  2. Tested with text2image (tesseract) just now. Yes, these messages appear in it too. Both text2image and text2tif show same number of fonts.

I think I had installed pango debug info also on cygwin - possibly that is giving extra info.

Shreeshrii commented 8 years ago
ra@Shree ~/tesseract-ocr/text2tif
$ ./text2tif --list_available_fonts
FcInitiReinitialize failed!!
Segmentation fault (core dumped)
Shreeshrii commented 8 years ago

$ ./text2tif --list_available_fonts --fonts_dir=

fonts-list.txt

amitdo commented 8 years ago

Because these messages also appear when you run Tesseract, retesting the previous commit is not needed.

Shreeshrii commented 8 years ago
ra@Shree ~/tesseract-ocr/text2tif
$ ./text2tif --fonts_dir= --text ../langdata/ara/ara.training_text --font FreeSerif --outputbase ara.FreeSerif.exp0
Could not find font named FreeSerif. Pango suggested font DejaVu Serif
Please correct --font arg.:Error:Assert failed:in file ./training/text2image.cpp, line 437
Segmentation fault (core dumped)
Shreeshrii commented 8 years ago

works if all info is given correctly

ra@Shree ~/tesseract-ocr/text2tif
$ ./text2tif --fonts_dir= --text ../langdata/san/san.training_text --font Kokila  --outputbase san.Kokila.exp0
Rendered page 0 to file san.Kokila.exp0.tif
Rendered page 1 to file san.Kokila.exp0.tif
Rendered page 2 to file san.Kokila.exp0.tif
Rendered page 3 to file san.Kokila.exp0.tif
Rendered page 4 to file san.Kokila.exp0.tif
Rendered page 5 to file san.Kokila.exp0.tif
Rendered page 6 to file san.Kokila.exp0.tif
Rendered page 7 to file san.Kokila.exp0.tif
Rendered page 8 to file san.Kokila.exp0.tif
Rendered page 9 to file san.Kokila.exp0.tif
Rtl = 0 ,vertical=0

ra@Shree ~/tesseract-ocr/text2tif
$ ./text2tif --fonts_dir= --text ../langdata/eng/eng.training_text --font Arial --outputbase eng.Arial.exp0
Rendered page 0 to file eng.Arial.exp0.tif
Rendered page 1 to file eng.Arial.exp0.tif
Rtl = 0 ,vertical=0

ra@Shree ~/tesseract-ocr/text2tif
$ ./text2tif --fonts_dir= --text ../langdata/ara/ara.training_text --font Arial  --outputbase ara.Arial.exp0
Rendered page 0 to file ara.Arial.exp0.tif
Rendered page 1 to file ara.Arial.exp0.tif
Rtl = 1 ,vertical=0
Shreeshrii commented 8 years ago
ra@Shree ~/tesseract-ocr/text2tif
$ ./text2tif --fontconfig_refresh_cache
Text file missing!
!FLAGS_text.empty():Error:Assert failed:in file ./training/text2image.cpp, line 427
Segmentation fault (core dumped)

ra@Shree ~/tesseract-ocr/text2tif
$ ./text2tif --fonts_dir= --text ../langdata/san/san.training_text --fontconfig_refresh_cache
Output file missing!
!FLAGS_outputbase.empty():Error:Assert failed:in file ./training/text2image.cpp, line 428
Segmentation fault (core dumped)
Shreeshrii commented 8 years ago

Please see Issue https://github.com/amitdo/text2tif/issues/6

Shreeshrii commented 8 years ago
ra@Shree ~/tesseract-ocr/text2tif
$ ./text2tif --fonts_dir= --text ../langdata/san/san.training_text --fontconfig_refresh_cache --outputbase san.Kokila.exp0
Stripped 2226 unrenderable words
Error in boxaGetExtent: boxa not defined
Error in boxaAddBox: box not defined
Rendered page 0 to file san.Kokila.exp0.tif
Stripped 2148 unrenderable words
Error in boxaGetExtent: boxa not defined
Error in boxaAddBox: box not defined
Rendered page 1 to file san.Kokila.exp0.tif
Stripped 2173 unrenderable words
Error in boxaGetExtent: boxa not defined
Error in boxaAddBox: box not defined
Rendered page 2 to file san.Kokila.exp0.tif
Stripped 1844 unrenderable words
Rendered page 3 to file san.Kokila.exp0.tif
Stripped 2603 unrenderable words
Rendered page 4 to file san.Kokila.exp0.tif
Stripped 1760 unrenderable words
Error in boxaGetExtent: boxa not defined
Error in boxaAddBox: box not defined
Rendered page 5 to file san.Kokila.exp0.tif
Rtl = 0 ,vertical=0

If font is not specified, default font Arial is used. If it does not have coverage for the script then the tif file will be blank.

Shreeshrii commented 8 years ago

To help find available fonts for a particular script/language eg. ta for Tamil

$ fc-list :lang=ta -f "%{file}\n%{family}\n%{style}\n\n"

/usr/share/fonts/win-fonts/Nirmala.ttf
Nirmala UI
Regular,Normal,obyčejné,Standard,Κανονικά,Normaali,Normál,Normale,Standaard,Normalny,Обычный,Normálne,Navadno,Arrunta

/usr/share/fonts/unifont/unifont.ttf
Unifont
Medium

/usr/share/fonts/win-fonts/NirmalaS.ttf
Nirmala UI,Nirmala UI Semilight
Semilight,Normal,obyčejné,Standard,Κανονικά,Regular,Normaali,Normál,Normale,Standaard,Normalny,Обычный,Normálne,Navadno,Arrunta

/usr/share/fonts/win-fonts/NirmalaB.ttf
Nirmala UI
Bold,Negreta,tučné,fed,Fett,Έντονα,Negrita,Lihavoitu,Gras,Félkövér,Grassetto,Vet,Halvfet,Pogrubiony,Negrito,Полужирный,Fet,Kalın,Krepko,Lodia

/usr/share/fonts/lohit-tamil/Lohit-Tamil.ttf
Lohit Tamil
Regular

/usr/share/fonts/lohit-tamil-classical/Lohit-Tamil-Classical.ttf
Lohit Tamil Classical
Regular
amitdo commented 8 years ago

fc-list :lang=en -f "%{family[0]} %{style[0]}\n" | sort -u > en-fonts-list

Shreeshrii commented 8 years ago

We cannot use ALL fonts for a particular language as some of them may not have correct rendering, specially for devanagari etc.

However such a list can be useful for fixing the language specific.sh file to only list available fonts.

Sample of incorrect rendering for devanagari:

san exp-1 unifont_medium