OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

recognize: use lstm_choice_mode=2 for textequiv_level=glyph #110

Closed bertsky closed 4 years ago

bertsky commented 4 years ago

Fixes #7.

This does appear to work and not cause any more segfaults with recent Tesseract. At last!

codecov[bot] commented 4 years ago

Codecov Report

Merging #110 into master will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #110   +/-   ##
=======================================
  Coverage   39.35%   39.35%           
=======================================
  Files           9        9           
  Lines         897      897           
  Branches      191      191           
=======================================
  Hits          353      353           
  Misses        492      492           
  Partials       52       52

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 95ba324...21f7386. Read the comment docs.

bertsky commented 4 years ago

This does appear to work and not cause any more segfaults with recent Tesseract.

Sorry, I was wrong: it still crashes!

bertsky commented 4 years ago

Sorry, I was wrong: it still crashes!

@stweil, here is a backtrace:

02:49:05.188 DEBUG processor.TesserocrRecognize - Decoding text in glyph 'region0008_line0000_word0002_glyph0000'

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffff3c0f6b6 in tesseract::ChoiceIterator::ChoiceIterator (this=0x179b6290, result_it=...) at ../src/ccmain/ltrresultiterator.cpp:389
389     if (strcmp(word_res_->CTC_symbol_choices[0][0].first, " ")) {
(gdb) fr 0
(gdb) p word_res_->CTC_symbol_choices 
$2 = std::vector of length 3, capacity 4 = {std::vector of length 0, capacity 0, std::vector of length 2, capacity 2 = {{
      first = 0x12f9ad70 "S", second = 3.5559907}, {first = 0x12f9bdf8 "Y", second = 6.70712423}}, 
  std::vector of length 16, capacity 16 = {{first = 0x12f9bdf8 "Y", second = 1.94020641}, {first = 0x12f9ad70 "S", second = 4.29620028}, {
      first = 0x12f9ae28 "O", second = 19.4803467}, {first = 0x12f9a9d8 "E", second = 21.8223343}, {first = 0x12f9b838 "P", 
      second = 22.4020309}, {first = 0x12f9b610 "T", second = 21.7317104}, {first = 0x12f9aa90 "F", second = 22.6467323}, {
      first = 0x12f9b108 "G", second = 22.7008057}, {first = 0x12f9beb0 "Z", second = 23.6807079}, {first = 0x12f9a7b0 "N", 
      second = 23.8459435}, {first = 0x12f9b6c8 "I", second = 24.8425007}, {first = 0x12f9a868 "D", second = 24.6850224}, {
      first = 0x12f9b558 "M", second = 27.1605015}, {first = 0x12f9cf38 "]", second = 28.5629539}, {first = 0x12f9a6f8 "A", 
      second = 28.6672821}, {first = 0x12f9a640 "W", second = 29.0376415}}}

So it appears that the first glyph's choices are all empty. Does that suffice for you?

I think it will be difficult to share my whole workspace. This happened with frk+Fraktur in the first glyph of the third word of the following line: DEWARPED8-CLIP-LINEIMG_0005_region0008_region0008_line0000 bin

bertsky commented 4 years ago

I have a fix.

So this can be merged independently I think. (We just have to make sure the PyPI release happens no earlier than the rebuild of the merged fix on alex-ppa and ocrd_all does not update ocrd_tesserocr before it updates tesseract.)

stweil commented 4 years ago

Thanks. The Tesseract fix is merged now.

kba commented 4 years ago

Perfect, thanks for the quick fix. I'll release a new version once the current nightly is in PPA