filyp / autocorrect

Spelling corrector in python
GNU Lesser General Public License v3.0
448 stars 79 forks source link

Support Spanish language #1

Closed pr3ssh closed 4 years ago

pr3ssh commented 4 years ago

I added es.tar.gz as an optional languages and also added unit tests strings but for some reason the Speller does not work properly. For creating es.tar.gz, I folloewd the steps that appears on README file. Any idea what can be wrong?

filyp commented 4 years ago

I have one guess. Make sure you have es.tar.gz BOTH in optional_languages and autocorrect/data. Speller first looks for it in autocorrect/data, and if it's not there, it tries to download from optional_languages on master. It's not merged to master yet so it will fail. If that won't help, paste any output you have, and some way to reproduce.

pr3ssh commented 4 years ago

Before this PR, my first (local) attempt was to put es.tar.gz into data folder in order to test

spell = Speller(lang='es')
spell('hloa')

but the output was hloa istead of hola (Spanish hello word).


README.md file suggest to use

count_words('eswiki-latest-pages-articles.xml', 'ru')

for getting Wikipedia Spanish words. I think that;s incorrect due I'm adding Spanish language. I changed by

count_words('eswiki-latest-pages-articles.xml', 'es')

That's the only change I did in the process of adding new language.

filyp commented 4 years ago

Ah, ok. The issue is probably, that the word 'hloe' exists in wikipedia, so the Speller doesn't try to correct it. The way I fixed it for other languages, was to cut out rarely used words. You can do it by calling for example:

spell = Speller(lang='es', threshold=4)

To use only words which appeared at least 4 times in wikipedia. You'll have to find the right threshold value empirically. After that, you can manually delete all those rare words from the file in es.tar.gz (it's already sorted so it should be easy). Later, I will update this section about adding new languages, because this step is important.

pr3ssh commented 4 years ago

With the new threshold...

Original number of words: 12196114 After applying threshold: 288623

I'm not really sure if it;s a lot but I tested some words and the Speller does not work properly with fewer threshold values.

filyp commented 4 years ago

For other languages I set it smaller, like 4, but I think that Spanish has less variants of the same words, and also Spanish wiki is probably larger. So as long as it works fine on unit tests it's fine.

filyp commented 4 years ago

I noticed es.tar.gz isn't stored in LFS, and I'd like to avoid bloating repo size. It probably happened because you forked before I set it up. You should be able to migrate it to LFS by running:

git lfs migrate import --include="*.tar.gz" --include-ref=refs/heads/master

And then force push.

filyp commented 4 years ago

It turned out LFS has a 1GB limit, after that it's paid and I've used up almost all of it. Also, there is no way to delete old, unnecessary files! :c I'll have to find some other way to store those tar.gz's. Storing them as regular files, without LFS is even worse, because there is a 500MB limit. I'll probably just put them in google drive. If you know of some better way let me know :)

pr3ssh commented 4 years ago

:thinking: Google Drive or any other server you have (HTTP or FTP). Good luck with that :crossed_fingers:

filyp commented 4 years ago

Hi, I can't download es.tar.gz anymore, so could you mail it to me to filipsondej@protonmail.com? I will add it to my google drive.

pr3ssh commented 4 years ago

I think you can downloading it from here: https://github.com/pr3ssh/autocorrect/tree/master/optional_languages

Pablo Martín Director tecnológico @pr3ssh

On Thu, Jun 11, 2020, 15:13 Filip Sondej notifications@github.com wrote:

Hi, I can't download es.tar.gz anymore, so could you mail it to me to filipsondej@protonmail.com? I will add it to my google drive.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/fsondej/autocorrect/pull/1#issuecomment-642639015, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADCW434C3NHTTYXPR3APWDRWDJ6LANCNFSM4NP5DXKQ .

filyp commented 4 years ago

I can't, when I follow the link, it only gives LFS reference:

version https://git-lfs.github.com/spec/v1
oid sha256:cad1ce706de6f7f84e420ece653af8d0ade59774c9bab12cdb0350e8f3b1a32a
size 1757679
pr3ssh commented 4 years ago

ACK

I'll send you ASAP. I had a problem with my laptop and yesterday I lost all my local data.

Pablo Martín Director tecnológico @pr3ssh

On Thu, Jun 11, 2020, 23:21 Filip Sondej notifications@github.com wrote:

I can't, when I follow the link, it only gives LFS reference:

version https://git-lfs.github.com/spec/v1 oid sha256:cad1ce706de6f7f84e420ece653af8d0ade59774c9bab12cdb0350e8f3b1a32a size 1757679

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fsondej/autocorrect/pull/1#issuecomment-642933882, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADCW43OXDUZ3VALZC4WPK3RWFDFFANCNFSM4NP5DXKQ .

filyp commented 4 years ago

OK, no hurry. Sorry for the lost data. Did you loose this tar.gz too?

pr3ssh commented 4 years ago

Yes, the entire /home partition 😭

Pablo Martín Director tecnológico @pr3ssh

On Fri, Jun 12, 2020, 02:01 Filip Sondej notifications@github.com wrote:

OK, no hurry. Sorry for the lost data. Did you loose this tar.gz too?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fsondej/autocorrect/pull/1#issuecomment-642988820, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADCW4YIAEAPHJZCMFQUCO3RWFV5TANCNFSM4NP5DXKQ .

filyp commented 4 years ago

:< I know this pain, happened to me last month too

filyp commented 4 years ago

I merged your changes in ec15a64a12de3654125df5f260d8c6db0d502b97 instead of merging this pull request, to avoid adding es.tar.gz to the repo. I added that es.tar.gz you sent me to google drive.

Thank you for contributing :)

pr3ssh commented 4 years ago

@fsondej it was a pleasure ;)