Closed pr3ssh closed 4 years ago
I have one guess. Make sure you have es.tar.gz BOTH in optional_languages and autocorrect/data. Speller first looks for it in autocorrect/data, and if it's not there, it tries to download from optional_languages on master. It's not merged to master yet so it will fail. If that won't help, paste any output you have, and some way to reproduce.
Before this PR, my first (local) attempt was to put es.tar.gz into data folder in order to test
spell = Speller(lang='es')
spell('hloa')
but the output was hloa istead of hola (Spanish hello word).
README.md file suggest to use
count_words('eswiki-latest-pages-articles.xml', 'ru')
for getting Wikipedia Spanish words. I think that;s incorrect due I'm adding Spanish language. I changed by
count_words('eswiki-latest-pages-articles.xml', 'es')
That's the only change I did in the process of adding new language.
Ah, ok. The issue is probably, that the word 'hloe' exists in wikipedia, so the Speller doesn't try to correct it. The way I fixed it for other languages, was to cut out rarely used words. You can do it by calling for example:
spell = Speller(lang='es', threshold=4)
To use only words which appeared at least 4 times in wikipedia. You'll have to find the right threshold value empirically. After that, you can manually delete all those rare words from the file in es.tar.gz (it's already sorted so it should be easy). Later, I will update this section about adding new languages, because this step is important.
With the new threshold...
Original number of words: 12196114 After applying threshold: 288623
I'm not really sure if it;s a lot but I tested some words and the Speller does not work properly with fewer threshold values.
For other languages I set it smaller, like 4, but I think that Spanish has less variants of the same words, and also Spanish wiki is probably larger. So as long as it works fine on unit tests it's fine.
I noticed es.tar.gz isn't stored in LFS, and I'd like to avoid bloating repo size. It probably happened because you forked before I set it up. You should be able to migrate it to LFS by running:
git lfs migrate import --include="*.tar.gz" --include-ref=refs/heads/master
And then force push.
It turned out LFS has a 1GB limit, after that it's paid and I've used up almost all of it. Also, there is no way to delete old, unnecessary files! :c I'll have to find some other way to store those tar.gz's. Storing them as regular files, without LFS is even worse, because there is a 500MB limit. I'll probably just put them in google drive. If you know of some better way let me know :)
:thinking: Google Drive or any other server you have (HTTP or FTP). Good luck with that :crossed_fingers:
Hi, I can't download es.tar.gz anymore, so could you mail it to me to filipsondej@protonmail.com? I will add it to my google drive.
I think you can downloading it from here: https://github.com/pr3ssh/autocorrect/tree/master/optional_languages
Pablo Martín Director tecnológico @pr3ssh
On Thu, Jun 11, 2020, 15:13 Filip Sondej notifications@github.com wrote:
Hi, I can't download es.tar.gz anymore, so could you mail it to me to filipsondej@protonmail.com? I will add it to my google drive.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/fsondej/autocorrect/pull/1#issuecomment-642639015, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADCW434C3NHTTYXPR3APWDRWDJ6LANCNFSM4NP5DXKQ .
I can't, when I follow the link, it only gives LFS reference:
version https://git-lfs.github.com/spec/v1
oid sha256:cad1ce706de6f7f84e420ece653af8d0ade59774c9bab12cdb0350e8f3b1a32a
size 1757679
ACK
I'll send you ASAP. I had a problem with my laptop and yesterday I lost all my local data.
Pablo Martín Director tecnológico @pr3ssh
On Thu, Jun 11, 2020, 23:21 Filip Sondej notifications@github.com wrote:
I can't, when I follow the link, it only gives LFS reference:
version https://git-lfs.github.com/spec/v1 oid sha256:cad1ce706de6f7f84e420ece653af8d0ade59774c9bab12cdb0350e8f3b1a32a size 1757679
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fsondej/autocorrect/pull/1#issuecomment-642933882, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADCW43OXDUZ3VALZC4WPK3RWFDFFANCNFSM4NP5DXKQ .
OK, no hurry. Sorry for the lost data. Did you loose this tar.gz too?
Yes, the entire /home partition 😭
Pablo Martín Director tecnológico @pr3ssh
On Fri, Jun 12, 2020, 02:01 Filip Sondej notifications@github.com wrote:
OK, no hurry. Sorry for the lost data. Did you loose this tar.gz too?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fsondej/autocorrect/pull/1#issuecomment-642988820, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADCW4YIAEAPHJZCMFQUCO3RWFV5TANCNFSM4NP5DXKQ .
:< I know this pain, happened to me last month too
I merged your changes in ec15a64a12de3654125df5f260d8c6db0d502b97 instead of merging this pull request, to avoid adding es.tar.gz to the repo. I added that es.tar.gz you sent me to google drive.
Thank you for contributing :)
@fsondej it was a pleasure ;)
I added es.tar.gz as an optional languages and also added unit tests strings but for some reason the Speller does not work properly. For creating es.tar.gz, I folloewd the steps that appears on README file. Any idea what can be wrong?