filyp / autocorrect

Spelling corrector in python
GNU Lesser General Public License v3.0
447 stars 79 forks source link

Error when adding Azerbaijani Language #45

Closed mshahcode closed 2 years ago

mshahcode commented 2 years ago

Hello, i am trying to add Azerbaijani language, but i can't manage to do it.

I added this "az": r"[AaBbCcÇçDdEeƏəFfGgĞğHhXxIıİiJjKkQqLlMmNnOoÖöPpRrSsŞşTtUuÜüVvYyZz]+", to words_reges Added this "az": "abcdefghijklmnopqrstuvxyzəüöğşçı", to alphabets

Then downloaded azwiki-latest-pages-articles.xml

When i run this code, it gives me error : from autocorrect.word_count import count_words count_words('azwiki-latest-pages-articles.xml', 'salam') PS: "salam" means "hello" in my language Follwing error appears:

Traceback (most recent call last): File "C:/Users/User/AppData/Local/Programs/Python/Python39/spell.py", line 5, in count_words('azwiki-latest-pages-articles.xml', 'salam') File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\autocorrect\word_count.py", line 19, in count_words counts = Counter(words) File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\collections__init.py", line 593, in init self.update(iterable, **kwds) File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\collections\init__.py", line 679, in update _count_elements(self, iterable) File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\autocorrect\word_count.py", line 9, in get_words word_regex = word_regexes[lang] KeyError: 'salam'

Any way to fix this ?

filyp commented 2 years ago

Ah, that's just a simple misunderstanding :) In the example count_words('hiwiki-latest-pages-articles.xml', 'hi'), it's not "hi" as in "hello", but a code word for Hindi.

So you should do:

count_words('azwiki-latest-pages-articles.xml', 'az')
mshahcode commented 2 years ago

Thanks, let me try))

mshahcode commented 2 years ago

I get now following error:

Traceback (most recent call last): File "C:/Users/User/AppData/Local/Programs/Python/Python39/spell.py", line 4, in count_words('azwiki-latest-pages-articles.xml', 'az') File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\autocorrect\word_count.py", line 19, in count_words counts = Counter(words) File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\collections__init.py", line 593, in init self.update(iterable, **kwds) File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\collections\init__.py", line 679, in update _count_elements(self, iterable) File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\autocorrect\word_count.py", line 12, in get_words for line in file: File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 7883: character maps to

filyp commented 2 years ago

You can try count_words(..., encd='utf-8')

mshahcode commented 2 years ago

Sorry, but not i get following error(( :

I have now this file : word_count.json

from autocorrect import Speller

spell = Speller('az') print(spell('selam'))

dictionary for this language not found, downloading... Traceback (most recent call last): File "C:/Users/User/AppData/Local/Programs/Python/Python39/spell.py", line 3, in spell = Speller('az') File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\autocorrect__init.py", line 83, in init__ self.nlp_data = load_from_tar(lang) if nlp_data is None else nlp_data File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\autocorrect__init.py", line 51, in load_from_tar urls = [ File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\autocorrect\init__.py", line 52, in gateway + path for gateway in ipfs_gateways for path in ipfs_paths[lang] KeyError: 'az'

filyp commented 2 years ago

It tries to download this file, because it didn't find it locally. You have to follow all the instructions from readme. Do this:

tar -zcvf autocorrect/data/az.tar.gz word_count.json

Inside tho directory where you cloned the repo.

mshahcode commented 2 years ago

I managed to add my language, thank you. But seems like these letters ə ü ö ğ ş ç ı doesn't appear in azwiki-latest-pages-articles.xml. For instance the word "Azerbaycan" should be "Azərbaycan" , but because in word_count.json we only have "Azerbaycan" it always corrects it to "Azerbaycan" instead of "Azərbaycan"

Is it possible to create my own .txt file, with different sentencen in Azerbaijani language and add this txt to your code, so that it will correct words based on the words in the txt file?

filyp commented 2 years ago

Sure, just give a different filename to count_words than this xml you downloaded.

mshahcode commented 2 years ago

This word "hüquqların" looks like this in word_count.json file : h\u00fcquqlar\u0131n": 9, Is this normal, can it somehow decrease chance of correcting a word ?

And also i have repeated 8 times word "azərbaycan" in my txt , but it didn't appeared in word_count.json file, why ?

filyp commented 2 years ago

Yup, it's normal, this is how unicode encodes characters.

As to this custom word_count.json, I don't know. Unfortunately I don't have time to support you further with that.

mshahcode commented 2 years ago

Okay, thanks for everything