fritz-hh / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
260 stars 31 forks source link

Spell check with aspell #106

Closed witchi closed 8 years ago

witchi commented 9 years ago

Hi,

Nice script, I use it with another script from http://www.konradvoelkel.com/2013/03/scan-to-pdfa/ Can you enhance your script with a call to aspell?

I have tried it within src/ocrPage.sh on line 198:

# perform spell check
[ $VERBOSITY -ge $LOG_DEBUG ] && echo "Page $page: Performing spell check"
!aspell --dont-backup --lang=de_DE --mode=sgml -c "${curHocr}" < /dev/tty   \
        && echo "Could not spell checking file \"${curHocr}\". Exiting..." && exit $EXIT_OTHER_ERROR

but it doesn't work with the Gnu-Parallel tool.

Thank you Andre

jbarlow83 commented 9 years ago

I believe the script you mention is an older version of OCRmyPDF. At the least, OCRmyPDF contains all of the ideas in that script and many additions as well.

I don't know in what sense "it doesn't work with GNU parallel" without some other information. Perhaps try keeping the temporary files around (argument -k) and then testing your command with the .hocr file to get it work standalone. My guess is that something is wrong with redirecting stdin < /dev/tty.

As far as spell check as a feature, Tesseract already does spell check internally. When it produces gibberish, Tesseract could not decide. For example "wrd" could be "word" or more rarely "ward", and if the letter is corrupted then it has no business deciding. It does not use NLP to try to figure out which word would fit grammatically. Spell check will help you filter out gibberish to generate a better list of keywords but it will not extract more information from a bad OCR result. Removes noise, but doesn't add signal. Make sense?

witchi commented 9 years ago

Yes, the script on http://www.konradvoelkel.com/2013/03/scan-to-pdfa/ is an older version of OCRmyPDF, but OCRmyPDF doesn't use a scanner to create an initial PDF. So I have combined both scripts to get a PNM file from scanimage, convert this into TIFF with scantailor and tiffcp and into a PDF with tiff2pdf, which I use in OCRmyPDF to get a PDF-A.

The call of parallel within OCRmyPDF.sh processes every page of the provided PDF in its own job (which can run in parallel), but no job has the possibility to use a terminal. You can use -tmux to get a terminal, but it will be closed before you can use it. The application aspell uses terminals to display an internal editor, which let you correct the words provided by tesseract. The output of tesseract is an SGML-style file (hocr), which aspell can parse and compare with language-specific dictionaries. If aspell finds an unknown word, it will suggest some similar words from the dictionary and the user can correct the word manually or replace it with a suggested word. The output of the OCR will be better, aspell let the user decide between "word" and "ward" and stores the decision into the hocr file.

To use aspell I have removed the usage of parallel from OCRmyPDF.sh and have replaced it with a for-loop, which processes all pages of the provided PDF sequentially. With this trick, I call the src/ocrPage.sh as a normal shell script (within the loop) and I can use aspell as described above (because the shell script has been bound to a terminal instead to a background job queue). The redirection of stdin to /dev/tty was necessary to display the internal editor of aspell (without the redirection aspell returns the error code 255).

I have never used parallel, so I don't know a way to use aspell and parallel together. Therefore I have started this issue.

jbarlow83 commented 9 years ago

I can't see that happening in the current, shell script version of this project (v2.x). It's a desirable feature, but the problem is that the script is currently set up to parallelize tesseract and you need serialized interactive input from /dev/tty. Even if it's technically possible to coordinate access to a shared resource in a shell script, I wouldn't want to go there.

There's a newer Python based version in my fork that I'm in the process of merging to the mainline. That framework could accommodate interactive prompts a lot more easily. It represents the script as a pipeline instead, and you'd insert a stage to the pipeline that acquires a semaphore and prompts for input. It could provide a GUI.

If you want to try that, as a very rough sketch you'd write a rule in ocrpage.py that transforms .hocr files:

from multiprocess import Lock
tty_lock = Lock()

@transform(ocr_tesseract, suffix(".hocr"), ".hocr.checked")
def spell_check_hocr(input_file, output_file):
    if not (spell check enabled):
        shutil.copy2(input_file, output_file)
        return
    with tty_lock:
        p = subprocess.call(['aspell', ...], stdin=PIPE)
        out, err = p.communicate('/dev/tty')

And then change the other dependent rules that involve ".hocr" to look for ".hocr.checked" instead.

hilsonp commented 9 years ago

Just a side note: you say you started a fork with Python based script. I'm currently writing the ocrmypdf.py. Hope you referred to ocrpage.py.

Le 24 mars 2015 à 00:47, jbarlow83 notifications@github.com a écrit :

I can't see that happening in the current, shell script version of this project (v2.x). It's a desirable feature, but the problem is that the script is currently set up to parallelize tesseract and you need serialized interactive input from /dev/tty. Even if it's technically possible to coordinate access to a shared resource in a shell script, I wouldn't want to go there.

There's a newer Python based version in my fork that I'm in the process of merging to the mainline. That framework could accommodate interactive prompts a lot more easily. It represents the script as a pipeline instead, and you'd insert a stage to the pipeline that acquires a semaphore and prompts for input. It could provide a GUI.

If you want to try that, as a very rough sketch you'd write a rule in ocrpage.py that transforms .hocr files:

from multiprocess import Lock tty_lock = Lock()

@transform(ocr_tesseract, suffix(".hocr"), ".hocr.checked") def spell_check_hocr(input_file, output_file): if not (spell check enabled): shutil.copy2(input_file, output_file) return with tty_lock: p = subprocess.call(['aspell', ...], stdin=PIPE) out, err = p.communicate('/dev/tty') And then change the other dependent rules that involve ".hocr" to look for ".hocr.checked" instead.

— Reply to this email directly or view it on GitHub.

jbarlow83 commented 9 years ago

@zorglups: ocrpage.py is in the "develop" branch of the main repository now. I haven't forgotten you expressed interested in writing the Python version (issue #94). We should probably discuss and share some ideas - probably better to merge earlier rather than later. I wrote some comments in issue #94.