filyp / autocorrect

Spelling corrector in python
GNU Lesser General Public License v3.0
447 stars 79 forks source link

German Regexp + Wordcount #33

Closed NiklasHoltmeyer closed 3 years ago

NiklasHoltmeyer commented 3 years ago

i added Regexp for the German Language and trained German Word Counts based on https://dumps.wikimedia.org/dewiki/latest/ [Wikipedia Dump (07.05.2021)] https://github.com/NiklasHoltmeyer/autocorrect/releases/tag/German

filyp commented 3 years ago

Great, can you upload the de.tar.gz file somewhere? Also, please add some unit test, similar to the existing ones in test_all.py

NiklasHoltmeyer commented 3 years ago

I uploaded the de.tar.gz to https://github.com/NiklasHoltmeyer/autocorrect/releases/tag/German https://github.com/NiklasHoltmeyer/autocorrect/releases/download/German/de.tar.gz

i will try to write some test within the next days.

NiklasHoltmeyer commented 3 years ago

i added tests.

filyp commented 3 years ago

I see that de.tar.gz has 150MB which is a lot. Did you do this "find_threshold" step from https://github.com/fsondej/autocorrect#adding-new-languages to reduce dictionary size? Or is it just because of the long compound words in German so there must be a lot of words in the dictionary?

NiklasHoltmeyer commented 3 years ago

sorry i forghott to uploaded the clean version

https://github.com/NiklasHoltmeyer/autocorrect/releases/tag/DE-Threshhold

here is the threshhold version

filyp commented 3 years ago

Hmm, in assets I only see the source code and no de.tar.gz like before

NiklasHoltmeyer commented 3 years ago

Hm Strange, i couldnt see the Files either, but they are there if i edit the Release.. I just reuploaded it and now i can see them!

https://github.com/NiklasHoltmeyer/autocorrect/releases/tag/DE-Threshhold

filyp commented 3 years ago

Great, I made a new PR #34 with this dictionary url added and fixed the black errors, so I''ll close this one.

filyp commented 3 years ago

also note, that this de.tar.gz file has bad directory structure inside and it fails, I changed it and uploaded to dropbox

filyp commented 3 years ago

hmm, for some reason the tests still fail with the

json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 337958 column 1 (char 7633222)