Unwrapping paragraphs - Githubissues

Golddouble commented 2 years ago

Thank you for creating this cool little app. But there is someting important I really miss.

When I try to use dpScreenOCR with this picture k20221001-090259

Then I get this output:

Eh bien, je m’entraîne beaur-
coup, je me prépare pour les
championnats suisses, après je
voudrais partir pour les Etats-
Unis. Mais ma mère préfère
rester en Europe.

But what I would like to have is this: Eh bien, je m’entraîne beaurcoup, je me prépare pour les championnats suisses, après je voudrais partir pour les Etats-Unis. Mais ma mère préfère rester en Europe.

What is the difference? A) I prefer to have a mode that produces no line breaks. B) Separate words like "beaur- coup" at hte end of a line should be written together: "beaurcoup". C) Hyphenated words in which the hyphen is not used to separate (like "Etats-Unis"): The hyphen should be retained here.

If you can not implement C) it would be good inough to have A) and B) .

What do you think?

danpla commented 2 years ago

Tesseract will probably be able to do this in the future: https://github.com/tesseract-ocr/tesseract/issues/728. If the Tesseract's recognition process can pick the right "dehyphenation" rules on a per-language basis, that's all we need.

Otherwise, removing hyphens on the dpScreenOCR side will require either a library for natural language processing or at least a spell checking library. In either case, the task is not trivial, since the recognized text can contain fragments in different languages. It will also require users to install extra data in addition to Tesseract languages.

Processing on the Tesseract side would definitely be the best solution, so I'd rather wait for https://github.com/tesseract-ocr/tesseract/issues/728 for a while (although the issue more than 5.5 years old :)

Golddouble commented 2 years ago

Thank you for your interesting feedback.

Your two proposals would lead to a perfect solution. However, I think we will have to wait for tesseract another 5 years. And your suggestion on the dpScreenOCR side is very complex (but very perfect).

Question: Couldn't we implement the whole thing experimentally and "imperfectly" on the dpScreenOCR side by simply replacing certain characters?

This are rules for Linux bash. Maybe you can do something similar for C:
------------------------------------------------------------------------------------------
Rule 1: Replace "{a..z}-\n{a..z}" with ""
Rule 2: Replace "{a..z}\n{a..z}" with ""
(Rule 3: Do not replace "{a..z}-/n{A..Z}")

Then do the same with: Hyphen U+2010 Breaking Hyphen U+2011 Figure Dash U+2012 EN Dash U+2013 EM Dash U+2014 Horizontal Bar U+2015 Hyphen Bullet U+2043 k20221005-124433

I think this would be very easy to implement. Of course, it would not be perfect. But it would save me a lot of post-processing work of the catched OCR text. And of course this does not work with all language. Arabical, Chinese, Russian. But it would be enough for European, American, Austarian, African? languages.

Of course, there must be a way to switch this function on or off as an option in the GUI.

A feedback is appreciated. Thank you.

danpla commented 2 years ago

Unfortunately, the "naive" algorithm will not work in most cases, removing hyphens when they should be kept, e.g., "twentieth-century music".

If you don't mind this kind of de-hyphenation, you can do it in a script executed via the "Run executable" action. In fact, this way it's easy to implement the proper algorithm, which will remove hyphens only if the deh-hyphenated word is in the list of valid words in a file. For French, you can download such a list here:

https://salsa.debian.org/gpernot/wfrench/-/blob/master/french

On Unix-like systems, you can also install this file (as /usr/share/dict/french) via the package manager. For example, this is the "wfrench" package on Ubuntu.

Golddouble commented 2 years ago

Thank you. Yes I will try to make a script for the naive algorithm. (I am not skilled enought to take the version with "wfrench")

But it looks like the argument from dpScreenOCR has no '\n' in $1. This means I can not replace "{a..z}\n{a..z}" .

This is how my script looks like:

#!/bin/bash
ersetzt="${1//'\n'/'eeeeeeee'}"
echo "$ersetzt" > "~/MyPath/ScreenOCR.txt"

My original picture: k20221005-222647

Content of the file ScreenOCR.txt:

célèbre = berühmt

à la campagne = auf dem Lande

des promenades au bord de la Seine = Spaziergänge am Seine-Ufer
lire un bon livre = ein gutes Buch lesen

le rôle principal = die Hauptrolle

--> Confused: It looks like there has not replaced any '\n' . But they are still there. (?) Very strange.

danpla commented 2 years ago

I'm not skilled enough in Bash, so here is a simple Python script that unwraps paragraphs using Aspell for spell checking. You will need to install the needed Aspell language (e.g. aspell-fr package for French on Ubuntu) and set ASPELL_LANG (will be passed as --lang option to Aspell).

The script works not only with the ASCII hyphen, but also with other kind of dashes (en dash, em dash, etc.).

You may want to remove the second call to is_valid_word(), so that in case of ambiguity the script prefers the word without the hyphen. This is probably the right thing to do in the general case, e.g. you don't want "car-pet" instead of "carpet".

#!/usr/bin/env python3

import datetime
import os
import subprocess
import sys
import unicodedata

ASPELL_LANG = 'fr'
APPEND_TO_FILE = os.path.expanduser("~/ocr_history.txt")

def is_dash(c):
   return unicodedata.category(c) == 'Pd'

def is_valid_word(word):
    with subprocess.Popen(
            ('aspell',
                '-a',
                '--lang=' + ASPELL_LANG,
                '--dont-suggest'),
            stdout=subprocess.PIPE,
            stdin=subprocess.PIPE,
            universal_newlines=True) as p:
        # ! to enter the terse mode (don't print * for correct words).
        # ^ to spell check the rest of the line.
        aspell_out = p.communicate(input='!\n^' + word)[0]

    # We use this function to check words both with and without
    # dashes. In the later case, Aspell checks each dash-separated
    # part as an individual word.
    #
    # If all words are correct in the terse mode, the output will be
    # a version info and an empty line.
    return aspell_out.count('\n') == 2

def unwrap_paragraphs(text, out_f):
    para = ''

    for line in text.splitlines():
        if not line:
            # Empty line is a paragraph separator
            if para:
                out_f.write(para)
                out_f.write('\n')
                para = ''

            out_f.write('\n')
            continue

        if not para:
            para = line
            continue

        if not is_dash(para[-1]):
            para += ' '
            para += line
            continue

        para_rpartition = para.rpartition(' ')
        para_last_word = para_rpartition[2]

        line_lpartition = line.partition(' ')
        line_first_word = line_lpartition[0]

        word_with_dash = para_last_word + line_first_word
        word_without_dash = para_last_word[:-1] + line_first_word

        if (is_valid_word(word_without_dash)
                # If the word valid both with and without the dash,
                # keep the dashed variant.
                and not is_valid_word(word_with_dash)):
            para = (para_rpartition[0]
                + para_rpartition[1]
                + word_without_dash
                + line_lpartition[1]
                + line_lpartition[2])
        else:
            para += line

    if para:
        out_f.write(para)

if __name__ == '__main__':
    with open(APPEND_TO_FILE, 'a', encoding='utf-8') as out_f:
        out_f.write(
            '=== {} ===\n\n'.format(
                datetime.datetime.now().strftime(
                    "%Y-%m-%d %H:%M:%S")))
        unwrap_paragraphs(sys.argv[1], out_f)
        out_f.write('\n\n')

Golddouble commented 2 years ago

Thank you very much for your script. :+1: :-) I appreciate it.

I have made the file "dpScreenOCRPython.py" with the content of this script and have added the path to it into the "action" tab. I have found the output in ~/ocr_history.txt. It works more or less.

Why only "more or less" ? It looks like tesseract sometimes thinks, that there are two line breaks ('\n\n') although there is only one.

Example: k20221007-101324

Result from tesseract:

Le centre commercial

Au centre commercial, on trouve sous un même
toit? beaucoup de magasins de détail et

de services (banque, poste, restaurant, etc.).
Les clients stressés ne doivent plus aller

d'un magasin à l'autre pour faire leurs courses.
Les centres commerciaux se trouvent à la péri-
phérie des villes. On y va donc en voiture et

on gare sa voiture dans les grands parkings.

Il y a des familles qui passent toute la journée
du samedi dans les centres commerciaux.

And of course your Python script converts this into:

Le centre commercial

Au centre commercial, on trouve sous un même toit? beaucoup de magasins de détail et

de services (banque, poste, restaurant, etc.). Les clients stressés ne doivent plus aller

d'un magasin à l'autre pour faire leurs courses. Les centres commerciaux se trouvent à la périphérie des villes. On y va donc en voiture et

on gare sa voiture dans les grands parkings.

Il y a des familles qui passent toute la journée du samedi dans les centres commerciaux.

So it looks like it is not enough when Python only looks at '\n' . It should also convert '\n\n'

Second: I would prefer that the Script does not create the "ocr_history.txt" file but brings the output directly into the clipboard instead.

I will not use the action-options ... -copy text into clipboard and -run a programm ... at the same time. So this will not be a conflict.

danpla commented 2 years ago

To copy text to the clipboard, you can use xsel or xclip. If you're not familiar with Python, it would be easier for you to replace the last block in the script (starts with if __name__ == '__main__':) with the following:

if __name__ == '__main__':
    unwrap_paragraphs(sys.argv[1], sys.stdout)

This way, the script will print to standard output instead of file, so you will be able to invoke it in a Bash script and then call xsel/xclip, like:

#!/bin/bash

TEXT=$(~/dpScreenOCRPython.py "$1")

xsel --clipboard <<< "$TEXT"

Unfortunately, removing empty lines will unconditionally join all paragraphs. This is something that should be done on Tesseract side; they already have an issue on the tracker: https://github.com/tesseract-ocr/tesseract/issues/2155. If you don't mind removing all empty lines, you can do it with TEXT=$(sed '/^$/d' <<< "$1") before calling the Python script. Alternatively, here is a bit more sophisticated Python script that only removes an empty line if the next one starts with a lower-case character:

#!/usr/bin/env python3

import sys

lines = sys.argv[1].splitlines()

for i, line in enumerate(lines):
    if (not line
            and i + 1 < len(lines)
            and (not lines[i + 1]
                or lines[i + 1][0].islower())):
        continue

    print(line)

You can combine both scripts like:

#!/bin/bash

TEXT=$(~/remove_empty_lines.py "$1")
TEXT=$(~/dpScreenOCRPython.py "$TEXT")

xsel --clipboard <<< "$TEXT"

Golddouble commented 2 years ago

Thank you very much. That's great stuff.

I think this is good enough for my purpose (translating from French into German with DeepL).

Golddouble commented 2 years ago

Follow up:

Your original Python script (https://github.com/danpla/dpscreenocr/issues/23#issuecomment-1270978326) makes two things:

it replaces things like:
```
beau-
coup
```
into beaucoup
it replaces things like:
```
les
championnats
```
into les championnats

Actually in the meantime I would prefer a script that only makes the

beau-
coup

replacement. Am I right, that in your original Python Script, you have separate sections for this two challanges. If yes, which section does what?

Thank you.

danpla commented 2 years ago

In the block that starts with if not is_dash(para[-1]):, replace para += ' ' with para += '\n'.

Golddouble commented 2 years ago

Thank you.