R1j1t / contextualSpellCheck

✔️Contextual word checker for better suggestions
MIT License
405 stars 56 forks source link

outcome_spellCheck and score_spellCheck do not match #45

Closed justinbhopper closed 3 years ago

justinbhopper commented 3 years ago

Describe the bug I am not sure if I am using this wrong, but it seems like the data in score_spellCheck does not match the outcome_spellCheck final output. I am assuming that outcome_spellCheck is a result of performing the spelling corrections with the highest probability score?

To Reproduce

import en_core_web_lg
import contextualSpellCheck

nlp = en_core_web_lg.load()
contextualSpellCheck.add_to_pipe(nlp)

tests = [
    "John says he is feeling depresed.",
    "Mary admits to drug adiction and forming bad habbits."
]

for test in tests:
    print("Input: ", test)
    doc = nlp(test)

    print("outcome_spellCheck:", doc._.outcome_spellCheck)
    print("score_spellCheck:")

    for token, suggestions in doc._.score_spellCheck.items():
        for suggestion in suggestions:
            suggested, score = suggestion
            print("  ", token.text, "->", suggested, " (" + str(score) +")")

Output:

Input:  John says he is feeling depresed.
outcome_spellCheck: John says he is feeling depressed.
score_spellCheck:
   depresed -> better  (0.14048)
   depresed -> sick  (0.06199)
   depresed -> fine  (0.04419)
   depresed -> well  (0.04254)
   depresed -> tired  (0.03723)
   depresed -> guilty  (0.03161)
   depresed -> good  (0.03133)
   depresed -> ill  (0.02869)
   depresed -> depressed  (0.02555)
   depresed -> bad  (0.0251)

Input:  Mary admits to drug adiction and forming bad habbits.
outcome_spellCheck: Mary admits to drug addiction and forming bad habits.
score_spellCheck:
   adiction -> ##ging  (0.50065)
   adiction -> use  (0.12457)
   adiction -> addiction  (0.09937)
   adiction -> dealing  (0.08391)
   adiction -> abuse  (0.07905)
   adiction -> trafficking  (0.01413)
   adiction -> ##gies  (0.00674)
   adiction -> drinking  (0.0065)
   adiction -> driving  (0.00415)
   adiction -> ##king  (0.00372)
   habbits -> relationships  (0.35707)
   habbits -> habits  (0.20654)
   habbits -> memories  (0.12251)
   habbits -> dreams  (0.0941)
   habbits -> thoughts  (0.02157)
   habbits -> alliances  (0.01681)
   habbits -> bonds  (0.01669)
   habbits -> marriages  (0.01668)
   habbits -> feelings  (0.01495)
   habbits -> plans  (0.01093)

Version (please complete the following information):

Additional information Note how the outcome_spellCheck has a good final output. However, score_spellCheck has corrections that are wildly unexpected. They seem more like synonyms than actual spelling corrections (e.g. "relationships" is no where close to the spelling "habbits"). Note how "depressed" got a miserable 0.02555 score, listed way below other corrections that are much farther from the original word.

R1j1t commented 3 years ago

In the current logic, spell corrector pipeline can be divided as follows:

  1. Misspell word identification
  2. Candidate generation for replacement
  3. Selection from the candidate

So, for the 3rd step i.e. Selection I am using Levenshtein distance. It is a metric to calculate difference between 2 strings (here misspell and candidate). So when you see score_spellCheck it gives you the candidate word and their probability based on BERT prediction (Step 2 above). But step 3 only uses Levenshtein distance to pick a candidate. Hence the outcome_spellCheck gives the output not based on probability from the model but from edit distance.

So, your doubt is correct, we could use probability as a metric for Step 3. The logic I finalized and implemented after reading from literature available. Please feel free to suggest some other implementation, or paper for reference. This package is still under development and before releasing v1.0

Feel free to open a PR if you would like to contribute!

justinbhopper commented 3 years ago

@R1j1t I think using the Levenshtein distance is the right approach. I should probably just be using suggestions_spellCheck to list out the closest candidate. I mistakenly thought the score represented the Levenshtein distance.

justinbhopper commented 3 years ago

@R1j1t One last question - is it possible to get the Levenshtein distance value for the suggested candidate?

R1j1t commented 3 years ago

Great! Regarding your question to get the Levenshtein distance, it is not recorded in any spaCy extension.

But this package relies on https://github.com/roy-ht/editdistance to calculate the metric. So maybe you can use it in your code. https://github.com/R1j1t/contextualSpellCheck/blob/f8cbeb8a7d5dc085f9f8cc5d27d390848d2df274/contextualSpellCheck/contextualSpellCheck.py#L397

If you want to help in adding an extra spacy extension (suggestion_edit_distance) please open a PR! You can check the code for reference here: https://github.com/R1j1t/contextualSpellCheck/blob/f8cbeb8a7d5dc085f9f8cc5d27d390848d2df274/contextualSpellCheck/contextualSpellCheck.py#L139-L148