barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
691 stars 101 forks source link

Missing countries in german language #140

Closed pokecheater closed 6 months ago

pokecheater commented 1 year ago

Correction of Türkei (turkey) becomes to Türme (towers) for example. I think proper country names are missing.

pokecheater commented 1 year ago

I had a look into the german dictionary and indeed country names are missing. But funny part is that türkisch (turkish) for example exist. I tested also Polen (poland) and Indien (India). It is the same: country names are not inside the dictionary.

image
pokecheater commented 1 year ago

https://www.thoughtco.com/countries-of-the-world-index-4101906

pokecheater commented 1 year ago

Another error i noticed is that for example the word "Tieren" means animals becomes to tiefen (deep). And the word fuer should become für in my opinion (fuer is not really wrong since this is the workaround writing case when the ü letter is missing).

sag das mal den Milliarden Tieren die fuer Fleisch getötet werden sag das mal den Milliarden tiefen die fuer Fleisch getötet werden

barrust commented 1 year ago

Thank you for this information. I am not a German speaker, so this is very helpful. The data used to build the word frequencies is from the OpenSubtitles project so these issues generate from there.

Any help maintaining and updating the languages would be much appreciated. The easiest method is to look in the scripts folder where you can see how the dictionaries are generated. There are a few files that can be used to ensure that certain words are removed, and others to ensure that missing words are added.

A Pull Request updating the excluded and included txt files would make the next build of the dictionaries reflect these changes. There may also be some code that could be written in the scripts/build_dictionary.py (likely in the clean_german() function) that could make this more automatic. For example, if ü -> ue often, perhaps finding all instances where that is the only difference in the word would make it easier to exclude those that have the ue spelling. Not sure if that is true or not, but something like that to find common errors could be useful.

Thanks!