Issues creating new dictionnary

filyp / autocorrect

Spelling corrector in python

GNU Lesser General Public License v3.0

447 stars 79 forks source link

Closed fliuzzi02 closed 3 years ago

fliuzzi02 commented 3 years ago

WHile following the tutorial for the hindi language (but using the italian wiki page instead), this error apperead:

'charmap' codec can't decode byte 0x9d in position 5050: character maps to <undefined>

This is the code used:

` from autocorrect.word_count import count_words

count_words('itwiki-latest-pages-articles.xml', 'it') `

filyp commented 3 years ago

Hmm, it looks that the Italian wiki has some chars that weren't present in other wikis. I found some solution here https://stackoverflow.com/questions/49640513/unicodedecodeerror-charmap-codec-cant-decode-byte-0x9d-in-position-x-charac which tells to use utf8 encoding. Try

count_words('itwiki-latest-pages-articles.xml', 'it', 'utf8')

fliuzzi02 commented 3 years ago

Thanks for the quick, reply, i have just figured it out, the problem was exactly that, i changed the count_words function with the utf-8 parameter

fliuzzi02 commented 3 years ago

How long do you think that the word json file generation will last?

P.S. Would you like me to send you the italian json file?

filyp commented 3 years ago

It should take a few hours, hard to tell exactly. Sure, send it. Also make a PR with the changes you have