ali1234 / vhs-teletext

Software to recover teletext data from VHS recordings.
GNU General Public License v3.0
204 stars 23 forks source link

Improve the spelling checker #70

Open ali1234 opened 2 years ago

ali1234 commented 2 years ago

The spellcheck tool automatically corrects errors using a dictionary-based lookup, similar to the one you would find in a word processor. This doesn't work very well because errors produced by the deconvolution algorithm are different to those that a human would make.

Deconvolution errors always result in exactly one character being substituted for another. For example, "e" is often confused with "j". This type of error is measured using Hamming distance, the number of such substitutions required to transform one string into another.

Humans make this type of error, but they also make another type of error where the correct letters are used but in the wrong order, and another type where letters are missed out or added. For example typing "teh" or "te" or "thee" instead of "the". These types of error are measured using Levenshtein distance, which is the number of insertions, deletions, or substitutions.

The spell check engine currently used measures error by Levenshtein distance, which means it incorrectly estimates the difference between the input string and the suggestions. This results in the suggestions being less than optimal. For example, given the input string "jdge" it will more likely suggest "judge" than "edge". If the text were typed by a human, "judge" would be the better suggestion. But given the way deconvolution works, "edge" is the best suggestion, and "judge" is impossible because it requires inserting a new character.

The current code contains tests for cases like these: the suggestions are filtered to remove any word that is a different length, and also any word that can't be reached via a list of common substitutions (e/j being one such pair.) This filtering has to be done after the spell check engine has returned a list of suggestions, which is limited to the best N suggestions it found. It is possible for the correct answer to be ranked too low by the Levenshtein distance test, so it doesn't get suggested, and the filtering returns no suggestions.

Calculating Levenshtein distance is also slower, so a more specific algorithm that used Hamming distance would be faster and more accurate. Unfortunately there doesn't seem to be any drop-in replacement spell checking engine that can be tuned to do this.

MatthewTromp commented 2 years ago

What about simply counting the number of differences between two strings? For instance

def hamming_distance(s1, s2):
    return sum(c1 != c2 for c1, c2 in zip(s1,s2))