AiPacino / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
2 stars 0 forks source link

Invocation such as: tesseract stdin stdout hocr < file.tif > file.html produces HTML file without BeginDocument/EndDocument #1196

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Pass the input file by 'stdin' and ask for hocr output.

What is the expected output? What do you see instead?

The file starts with:
  <div class='ocr_page' id='page_1' title='image ""; bbox 0 0 2463 3565; ppageno 0'>

rather than the XML preamble. It seems that the Renderer's 
BeginDocument/EndDocument invocations are missing in this case.

What version of the product are you using? On what operating system?

The latest SVN build on OS X.

Please provide any additional information below.

Based on quick perusal of the code, the issue is that BeginDocument is only 
called on renderer on the code path that uses ProcessPages(), and requires 
filename as input. However, when image is provided by stdin, the method being 
called is ProcessPage(), and it is provided with image that has already been 
read.

In addition to this issue, I'm seeing a very bizarre HTML document at least 
sometimes when passing a filename on command line. It appears that the output 
simply terminates entirely on the image name:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>
</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name='ocr-system' content='tesseract 3.03' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
  <div class='ocr_page' id='page_1' title='image "??R8? </body>
</html>

I'm receiving this kind of output with a file called "R1VRYhtymä_Oy.tif". The 
name is important, though it's probably about the size of the value more than 
anything else. My guess is that the HOcrEscape() function returns reference to 
memory that has already been free'd, since the string() method on STRING seems 
to simply return the underlying pointer, and the instance goes out of scope at 
the end of the function.

Original issue reported on code.google.com by alank...@bel.fi on 11 May 2014 at 10:56

GoogleCodeExporter commented 9 years ago
Thanks. Fixed in r1099.

Original comment by zde...@gmail.com on 11 May 2014 at 4:00