Closed pokecheater closed 6 months ago
I had a look into the german dictionary and indeed country names are missing. But funny part is that türkisch (turkish) for example exist. I tested also Polen (poland) and Indien (India). It is the same: country names are not inside the dictionary.
Another error i noticed is that for example the word "Tieren" means animals becomes to tiefen (deep). And the word fuer should become für in my opinion (fuer is not really wrong since this is the workaround writing case when the ü letter is missing).
sag das mal den Milliarden Tieren die fuer Fleisch getötet werden sag das mal den Milliarden tiefen die fuer Fleisch getötet werden
Thank you for this information. I am not a German speaker, so this is very helpful. The data used to build the word frequencies is from the OpenSubtitles project so these issues generate from there.
Any help maintaining and updating the languages would be much appreciated. The easiest method is to look in the scripts folder where you can see how the dictionaries are generated. There are a few files that can be used to ensure that certain words are removed, and others to ensure that missing words are added.
A Pull Request updating the excluded
and included
txt files would make the next build of the dictionaries reflect these changes. There may also be some code that could be written in the scripts/build_dictionary.py
(likely in the clean_german()
function) that could make this more automatic. For example, if ü -> ue
often, perhaps finding all instances where that is the only difference in the word would make it easier to exclude those that have the ue
spelling. Not sure if that is true or not, but something like that to find common errors could be useful.
Thanks!
Correction of Türkei (turkey) becomes to Türme (towers) for example. I think proper country names are missing.