Extra space added in the Table data when Converting from Image to Text

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Convert the Image file to Text thru Tessar Act ocr conversion tool
2. The table data will get converted as attached Text file
3. Open the Text file in Editplus to see the issue 
4. You can see the Lines highlighed in Red in the image files and the 
corresponding same line in the Output Text file for this issue

What is the expected output? What do you see instead?
The Position of the data should be properly aligned under each column heading. 
It sdould not have extra spaces which is repositioning the data away from the 
column.

What version of the product are you using? On what operating system?
Java wrapper (J4L OCR) for the Tesseract OCR engine with Apache version 2.0 on 
Windows OS

Please provide any additional information below.
Due to this dispositioning in the Output Text file, XML conversion is not 
showing correctly. This Behaviour is occuring when any one or more column is 
having Empty data or any other special characters (*,#). What is the Algorithm 
or procedure, the OCR is using to set this extra white space. What is the Logic 
to calculate and remove it? How is OCR handling the Conversion when the Data is 
in Table format and when the Tabe is not Completely filled?

Original issue reported on code.google.com by gokul...@gmail.com on 12 Mar 2012 at 5:28

Attachments:

Image.tif
[Tessaract-OCR Converted File.txt](https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-648/comment-0/Tessaract-OCR Converted File.txt)

GoogleCodeExporter commented 9 years ago

Hi
I am facing same problem while converting table data in image to text m geeting 
more space through tessaract ocr.Is there any solution for ths help me.

Original comment by allam.ro...@gmail.com on 13 Mar 2012 at 4:28

GoogleCodeExporter commented 9 years ago

 It would helpful if indicated which version of tesseract-ocr was downloaded from svn and are being used. On which windows OS like WinxP or win7 etc. are being used for  testing. These particulars are essential to enable project members to give their comments, as deemed fit.

Original comment by withbles...@gmail.com on 14 Mar 2012 at 5:34

GoogleCodeExporter commented 9 years ago

We are using the J4L Java wrapper for Tesseract OCR engine 3.0 version and the 
testing is done in Windows 7 OS. I just tested this in Windows XP also and it 
is showing the same issue. It will be great if i get a solution for this.

Original comment by gokul...@gmail.com on 14 Mar 2012 at 10:10

GoogleCodeExporter commented 9 years ago

I have the same problem too. I tried the CLI with -psm switch as 6. I get least 
number of error in this mode, but all the extra spaces are stripped off. Please 
let me (us) know if there is a solution (or hack in the code) for this.

Regards

Original comment by MathewJo...@gmail.com on 20 Mar 2012 at 12:08

GoogleCodeExporter commented 9 years ago

Issue 560 has been merged into this issue.

Original comment by zde...@gmail.com on 24 Jul 2012 at 8:22

GoogleCodeExporter commented 9 years ago

Do you know what's the solution about this error message "PR0302 - REQUESTED 
TESSERACT TABLE DATA WAS NOT FOUND" ?  

Thanks!

Original comment by christin...@gmail.com on 1 Oct 2014 at 8:58

ecit241 / tesseract-ocr

Extra space added in the Table data when Converting from Image to Text #648