filyp / autocorrect

Spelling corrector in python
GNU Lesser General Public License v3.0
459 stars 97 forks source link

Improve model #36

Closed ByUnal closed 3 years ago

ByUnal commented 3 years ago

Before everything, it is great tool. Thank you for your work. Altough Turkish-nlp is a difficult task, your model handels succesfully. However, there are some mistakes in some specific sentences which I tried for testing. It doesn't fix some specific words. At this point, I want to improve this model. What can I do for it ? If I can collect huge corpus for training from scratch, would that be useful ?

filyp commented 3 years ago

I'm glad you like it :) I doubt that more data would help, it was already trained on the whole wikipedia. It's possible that the threshold value for Turkish has been set poorly, so some solution would be to repeat the steps in https://github.com/filyp/autocorrect#adding-new-languages but with more Turkish words in tests - then the threshold value found would be better.

Apart from that, the only thing that I think could help, is to process the context of each word, but that would be a major change and I don't have time to implement it. You could check out some other tools which look at the context, like this one: https://github.com/neuspell/neuspell If you do, let me know how well did it work.

ByUnal commented 3 years ago

I tried neuspell. It seems very good tool for English. However, It is not working well for Turkish. I did fine tunning. Result was very bad. I tried fine tunning with different models and parameters. At some point, it threw me "Buy new RAM!" error. If I would do those trainings, maybe it would work. But so far, it is not useful for me. Now I'm going to try this threshold thing you mentioned

ByUnal commented 3 years ago

By the way I'm trying to add new language (Turkish). I downloaded the wiki file, and when I entereed count_words('trwiki-latest-pages-articles.xml', 'tr') it throws me UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4164: character maps to <undefined> error.

filyp commented 3 years ago

You could try count_words(..., encd='utf-8'), as I remember someone has a similar issue in the past.

ByUnal commented 3 years ago

ok, thanx.