barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
694 stars 101 forks source link

Spellchecker for Bahasa (id) #82

Closed Balurc closed 3 years ago

Balurc commented 3 years ago

Hi,

Thank you for such a great work, it helps me a lot. Would it be possible for you to update the resources with Bahasa Indonesia (id)? I have downloaded the text file from here, https://github.com/hermitdave/FrequencyWords/blob/master/content/2018/nl/id_full.txt, and converted it to json.

then i follow below steps: from spellchecker import SpellChecker spell = SpellChecker(language=None) spell.word_frequency.load_dictionary('id_full.json')

then i created a list of some misspelled and correct spelled word in Bahasa Indonesia in here 'makn' is a misspelled word of 'makan', 'skt' --> 'sakit', these are common words in Bahasa misspelled = spell.unknown(['makn', 'minum', 'sakt', 'kemana', 'gila'])

then i run below loop and it prints nothing for word in misspelled: print(spell.correction(word)) print(spell.candidates(word))

Is there something wrong with any of the steps above?

barrust commented 3 years ago

@Balurc thank you for your kind words. I am glad that this project has been helpful!

So I built the same dictionary as you by going through the same process to build the index and dump it as a json object:

input = "id_full.txt"
output = "id_full.json"

word_frequency = dict()

with open(input, "r") as f:
    for line in f:
        parts = line.split()
        word_frequency[parts[0]] = int(parts[1].strip())

with open(output, 'w') as f:
    json.dump(word_frequency, f, indent="", sort_keys=True, ensure_ascii=False)

I then ran your code to see if there was something wrong and I too got nothing printed out.

from spellchecker import SpellChecker

spell = SpellChecker(language=None, local_dictionary=output)
misspelled = spell.unknown(['makn', 'minum', 'sakt', 'kemana', 'gila'])
for word in misspelled:
    print(spell.correction(word))
    print(spell.candidates(word))

So I looked at the json object built to see if what could be wrong and this is what I found:

{
    "gila": 30302,
    "kemana": 16134,
    "makn": 1,
    "minum": 20376,
    "sakt": 2,
}

So it looks like the words you were testing with made it into the dictionary from the source as actual words.

I have built a script to automate the building of the supported dictionaries (scripts/build_dictioary.py). It would be possible to add id to the list of dictionaries but figuring out how to clean up is always the issue. Any help would be appreciated.

barrust commented 3 years ago

@Balurc, was this able to resolve your issue? I am going to close this since for now. If you are still having issues, please re-open with new details.