Open mattleff opened 3 months ago
There are common mistakes that are unique to me. And the มากว่า
bad word will never be caught by a dictionary. I am not opposed to using a dictionary approach, but it will be a lot of work when it is implemented. However there still needs a negative list to catch things not catchable by a dictionary.
There should be some book level dictionaries. The book that Amnart is working on has many unique words that will not occur in any other book but that are frequent in this mission book on Borneo.
@bountonw Quick update: I've made a proof-of-concept here locally and run it against the LBF book. I only had to add 19 custom words and I found 10 potential typos (see https://github.com/bountonw/translate/pull/276). The process was somewhat manual so I'll need to think about how to make it more automatic/scriptable.
Most of the files are updated correctly. Two of the files were correct already and shouldn't have been changed. One of the files isn't clear what was changed. I have commented on the PR. This is very helpful.
Currently we use forbidden words to define misspellings or words that should be avoided. This works, but has some downsides:
I propose that over time we replace the forbidden words system with a custom dictionary-based spellchecking system. This depends on four pieces:
hunspell-th,or, likely, a combination of multiple dictionaries.cspell,PySpelling,node-markdown-spellcheck,or https://github.com/prosebot/node-markdown-spellcheck but more likely will have to be custom developed since Thai word segmentation may not work (well 🙃) with existing spellcheckers.Once this infrastructure was developed we would have full knowledge of dictionary-based word boundaries in the translation text. This could possibly enable more-perfect (near-perfect?) hyphenation/word-breaking markup. It may be necessary to define each custom term's allowed/preferred word-breaking positions.
Also, with a full dictionary it would be possible to offer Levenshtein distance-based misspelling suggestions (using one of these). This could allow spellchecking to automatically suggest, for example,
วิญญาณ
whenever a translator usedวิณณาณ
, etc. And rather than translators manually defining all the near possible alternatives, the dictionary would be the source of what the possible alternatives would be.