baopham1340 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

psm 6 not used for text file when outputting pdf #1365

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. run tesseract with -psm 6 pdf options
2.
3.

What is the expected output? What do you see instead?
I expect the text generated in txt output and text layer in pdf to be the same.
The txt file seems to be not using the psm option

What version of the product are you using? On what operating system?
latest version from git, windows 8, msys2

Please provide any additional information below.

With some devanagari files, the output from psm 3 and 4 is not accurate, while 
psm6 works better. 

A txt file is also generated when using pdf option. However, that does not seem 
to be applying psm-6.

Original issue reported on code.google.com by shreeshrii on 31 Oct 2014 at 4:12

GoogleCodeExporter commented 9 years ago
please provide test case.

Original comment by zde...@gmail.com on 7 Feb 2015 at 7:17

GoogleCodeExporter commented 9 years ago
now=$(date +"%y%m%d-%H%M");
LANG=hin
cd testing
for f in page-019.tif
do
echo "OCR at $(date) with -l $LANG for $f file , please wait..."
tesseract  --tessdata-dir C:/Home/UserShree/tesseract-ocr/testing   $f $f-$LANG 
 -l $LANG   -psm 6 pdf 
LANG=san
tesseract  --tessdata-dir C:/Home/UserShree/tesseract-ocr/testing   $f $f-$LANG 
 -l $LANG   -psm 6 pdf 
done
------------
The txt and pdf output as well as the input file are attached. When I copy the 
text from pdf, it is formatted differently than the txt files.

Original comment by shreeshrii on 17 Feb 2015 at 1:30

Attachments: