JohnWang0512 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

pdfrenderer.cpp int word_x1, word_y1, word_x2, word_y2; #1224

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. tesseract phototest.tif phototest pdf

[Windows platform (compiled using Visual Studio 2012)]

What is the expected output? 
OCR Text lined up with phototest.tif original image.

What do you see instead?
OCR Text is not available at all.

Please use labels and text to provide additional information.

Issue relates to pdfrenderer.cpp code

https://code.google.com/p/tesseract-ocr/source/browse/trunk/api/pdfrenderer.cpp?
r=1042

Some improvement if line 113 is updated from:

int word_x1, word_y1, word_x2, word_y2;

to:

int word_x1 = 0;
int word_y1 = 0;
int word_x2 = 0;
int word_y2 = 0;

Original issue reported on code.google.com by supp...@darkblueduck.com on 4 Jun 2014 at 1:18

GoogleCodeExporter commented 9 years ago
If the line_x1, line_y1, line_x2 and line_y2 variable declarations are moved 
above the while (!res_it->Empty(RIL_BLOCK)) { line (83) the code seems to work 
well.

  int line_x1 = 0;//-858993460;
  int line_y1 = 0;//-858993460;
  int line_x2 = 0;//-858993460;
  int line_y2 = 0;//-858993460;

  while (!res_it->Empty(RIL_BLOCK)) {
    if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
      pdf_str += "BT\n3 Tr\n";  // Begin text object, use invisible ink
      old_pointsize = 0.0;      // Every block will declare its font
    }

Original comment by supp...@darkblueduck.com on 4 Jun 2014 at 1:32

GoogleCodeExporter commented 9 years ago
I haven't tested, but that sounds much more reasonable to me.

I'm more of a C programmer than a C++ programmer, so perhaps this is some part 
of the standard I'm not familiar with, but redeclaring variables inside loops 
on each iteration looks really weird to me. Is there some reason for it? If not 
I'd vote for moving them out near the top of the function, and explicitly 
zeroing them on each iteration if that's what's needed.

Original comment by nick.wh...@durham.ac.uk on 4 Jun 2014 at 4:31

GoogleCodeExporter commented 9 years ago
I compiled current tesseract code (r1117) with vs2010 and I can not reproduced 
the issue. I created phototest.txt as copy&paste from phototest.pdf (opened in 
Adobe Reader XI version 11.0.07.79).

IMO current code could cause problem only in case if the code in next line:
 res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
does not work correctly. Initialization of local variables with zeros will hide 
the problem, and moving declaration outside loop will not help too - value from 
previous cycle will be used for next word.

So proposed solutions does not fix the real problem (if there as any) because 
problem should be in output of res_it->Baseline in VS2012. 

Original comment by zde...@gmail.com on 22 Jun 2014 at 8:20

Attachments:

GoogleCodeExporter commented 9 years ago
Yesterday, I was caught here by the (VS 2013) debugger, which would not let me 
continue (!), because of uninitialized variables. Zeroing them before first use 
solved this problem.

PS.
What a great project!

Original comment by povprec...@hotmail.com on 19 May 2015 at 9:57

GoogleCodeExporter commented 9 years ago
@povprecnez@hotmail.com: Are you able to reproduce error with current code 
(from repository? Can you provide test case (input images and maybe code if you 
tesseract as library)?

Original comment by zde...@gmail.com on 19 May 2015 at 11:06

GoogleCodeExporter commented 9 years ago
zde...@Hotmail.com;

I checked the latest source now of the file "pdfrenderer.cpp" (file is from may 
12th 2015). I see there is much newer code in it, which makes my comment above 
useless and out of line. I had an older version. Sorry for the confusion!

Original comment by povprec...@hotmail.com on 19 May 2015 at 9:09

GoogleCodeExporter commented 9 years ago
I do not believe that real problem is in not initialized variables in 
pdfrenderer.cpp (see coment #3). That's why we need test case that is able to 
reproduce error...

Original comment by zde...@gmail.com on 20 May 2015 at 6:58