barrust / pyspellchecker

Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
MIT License
710 stars 164 forks source link

Function word_probability(word) returns 0.0 #55

Closed corinabioinformatic closed 5 years ago

corinabioinformatic commented 5 years ago

Hi, I am not sure how to use word_probability(word) function. I am currently using pyspellchecker to complete a list of mispelled words. But It gives me an output of 0. And what I need is a list of probability per each candidate in the word list printed before. Here the code:

import 

myListOfWords = ['medicin', 'increas', 'caus', 'daili', 'reduc', 'healthi', 'vaccin', 'diseas', 'intak', 'peopl', 'realli', 'diabet', 'exercis', 'possibl', 'pressur', 'bodi']

spell = SpellChecker()

for word in myListOfWords :

    print(spell.correction(word)) # gives the  "best candidate" in theory
    print(spell.candidates(word)) # gives the candidates ( i dont understand the order of the words)
    print(spell.word_probability(word)) # here I need the probability of the candidates to see which is the first best and the second best candidates. 

Why Am I doing that? In the code you can see that 'diabet' word returns 'diet' instead of 'diabetes'.

I would like to find an accurate correction related to my topic. As far as I know my options are :

1) Passing "distance =1" argument in the 'correction' function-> does not correct the problem with 'diabet' word.

2) Providing a text file dictionary with all the words of my interest as you suggested here (load_text_file. Question, what is the expected format for this txt file? Could you share a example? )

3) Adding a new function to correct the algorithm based in the terminology I am using (Health related terminology) , by mean of adding a new argument (topic = "Health") and therefore biasing the spell corrections to all the related terminology to that topic. Are you already developing anything like that in the module?

Please could you give me a clue about how to do this (2 & 3 questions)? Many thanks!

UPDATE I am using the txt file of medical terms provided by @glutanimate & @dgreuel here. I think it solved partially the issue for my purposes.

barrust commented 5 years ago

So part of this is a confusion of what the word_probability is for and what it means. The word probability is the ratio of the word compared to the corpus. It allows for us to choose which of two words that are both possible answers should be selected. Generally you will never need to use that function unless you wanted to inspect a words value. You are likely getting 0.0 since those words are not in the result set. That could be a bug that should be resolved (likely throw an exception).

As for your other questions:

  1. Yes, selecting a distance of 1 would not solve that problem as diet is likely more common in the corpus.
  2. The text file can be of any form. It could be sentences, just words, anything. It just has to be a *.txt file; the one you listed would be perfect. I would recommend, if you are looking for very specific terms, to build your own spelling database: documentation on building a new dictionary
  3. I am not adding anything to have topic based dictionaries at the moment. Perhaps this could be its own issue that can be used for a place of discussion on the idea.

I hope this is helpful!

corinabioinformatic commented 5 years ago

Thank you very much Barrust. I will take a look to how the programme works in deeper and the link to the documentation on building a new dictionary. Very interesting!