bountonw / translate-tooling

MIT License
2 stars 0 forks source link

Use dictionary-based spellchecking #14

Open mattleff opened 3 months ago

mattleff commented 3 months ago

Currently we use forbidden words to define misspellings or words that should be avoided. This works, but has some downsides:

  1. While in rare cases there may be more than one correct spelling of a word, there are an infinite misspellings of any word.
  2. Because of that, even defining hundreds or thousands of common misspellings will never guarantee total correctness.
  3. Particularly for proper nouns, defining the correct spelling is the only way to ensure that all occurrences use the same spelling.

I propose that over time we replace the forbidden words system with a custom dictionary-based spellchecking system. This depends on four pieces:

  1. Accurate word boundary identification (see algorithms and tools). This can be done with either Swath or pythainlp (tutorial), both of which support robust, dictionary-based Thai word segmentation.
  2. A base Thai dictionary/word list. Some possible options: libthai, tex-hyphen (modified from libthai), hunspell-th, or, likely, a combination of multiple dictionaries.
  3. A Thai religious terms seed list. If we parse the existing translations in bountonw/translate and strip all known terms using the base Thai dictionary we will be left with a set of possible good religious terms. We could also parse the modern (used?) Thai Bible translations to generate additional potential religious terms for our extension list. There would need to be a manual checking of the additional religious terms to ensure that we didn't simply codify any existing misspellings. This list would necessarily grow over time.
  4. The actual Markdown spellchecking process. This could possibly use cspell, PySpelling, node-markdown-spellcheck, or https://github.com/prosebot/node-markdown-spellcheck but more likely will have to be custom developed since Thai word segmentation may not work (well 🙃) with existing spellcheckers.

Once this infrastructure was developed we would have full knowledge of dictionary-based word boundaries in the translation text. This could possibly enable more-perfect (near-perfect?) hyphenation/word-breaking markup. It may be necessary to define each custom term's allowed/preferred word-breaking positions.

Also, with a full dictionary it would be possible to offer Levenshtein distance-based misspelling suggestions (using one of these). This could allow spellchecking to automatically suggest, for example, วิญญาณ whenever a translator used วิณณาณ, etc. And rather than translators manually defining all the near possible alternatives, the dictionary would be the source of what the possible alternatives would be.

bountonw commented 3 months ago

There are common mistakes that are unique to me. And the มากว่า bad word will never be caught by a dictionary. I am not opposed to using a dictionary approach, but it will be a lot of work when it is implemented. However there still needs a negative list to catch things not catchable by a dictionary.

There should be some book level dictionaries. The book that Amnart is working on has many unique words that will not occur in any other book but that are frequent in this mission book on Borneo.

mattleff commented 2 months ago

@bountonw Quick update: I've made a proof-of-concept here locally and run it against the LBF book. I only had to add 19 custom words and I found 10 potential typos (see https://github.com/bountonw/translate/pull/276). The process was somewhat manual so I'll need to think about how to make it more automatic/scriptable.

bountonw commented 2 months ago

Most of the files are updated correctly. Two of the files were correct already and shouldn't have been changed. One of the files isn't clear what was changed. I have commented on the PR. This is very helpful.