R1j1t / contextualSpellCheck

✔️Contextual word checker for better suggestions
MIT License
405 stars 56 forks source link

Adding Damerau, case-insensitivity comparison, Bayesian decision making options #88

Open fingoldo opened 1 year ago

fingoldo commented 1 year ago

Describe the problem you met

Hi, thanks for this useful lib, I think it can be improved even further! ;-)

I was disappointed by the lib output for this case:

doc = nlp(
    "Visa to dubia ",
)

print(doc._.performed_spellCheck)
print(doc._.outcome_spellCheck)
print(doc._.score_spellCheck)

True Visa to. {dubia: [('.', 0.72154), (';', 0.07971), ('?', 0.04526), ('Norway', 0.00669), ('Estonia', 0.00593), ('Taiwan', 0.00555), ('Bermuda', 0.00531), ('Iceland', 0.00479), ('Italy', 0.00466), ('Germany', 0.00437)]}

Describe the solution you'd like Obviously, top_n parameter which is internal to the candidate_ranking method should be made a part of constructor. I made that adjustment locally, and unexpectedly, Cuba became a winner instead of Dubai. 2 reasons were immediately evident: 1) candidates to misspell comparisons are case-sensitive, which is not good for accuracy 2) symbols swaps are considered of the same cost as 1 addition and one deletion, which is clearly a drawback.

Therefore, optional case insensitivity and Damerau-Levenshtein distance instead of plain Levenshtein are desired. Once implemented locally, I arrived at the desired outcome (Dubai).

However, it felt as a lucky coincidence, as the probabilities coming from underlying Bert were not accounted for when making a decision. I thought it would be very natural to be able to handle them in conjunction with the textual similarities in a Bayesian fashion, where Bert probs could be our prior, cand to misspell textual sim would be B/A conditional probs of B being the correct guess given that A was typed in. To soften the impact of possible Bert miscalibration, absolute probs conversion to ranks would be desired.

Summary of my suggestions: feature request could be implemented by adding a few parameters to the class constructor:

top_n (int, optional): suggestions from underlying ANN model to be considered. Defaults to 10. lowercased_distance (bool, optional): lowercase candidates before computing edit distance. Defaults to True. damerau_distance (bool, optional): additionally account for symbol swaps when calculating a distance. Defaults to True. bayes_selection (bool, optional): use bayes reasoning when selecting the best candidate. Bert probabilities are the prior, textual similarities of candidates to the input are treated as the probabilities B/A that the correct candidate is A, while the input was B. Defaults to True. ranked_bert_probs (bool, optional): use ranked probs as opposed to the absolute probs values coming from Bert. Defaults to True.

Calling code would become: contextualSpellCheck.add_to_pipe(nlp,config=dict(top_n=500,lowercased_distance =True,damerau_distance =True,bayes_selection =True,ranked_bert_probs =False))

After the implementation, I arrive at Dubai instead of the dot, which makes me happy )

Describe alternatives you've considered I was not able to find any alternative solution. Tried a few Spacy models, to no avail.

Additional context I used

            Doc.set_extension("bayes_probs", default=None)
            Doc.set_extension("bayes_details", default=None)

to provide additional details for the occasional exigent user:

screenshot detailing resulting candidates rating and bayesian details

If you think the community can benefit from this new functionality, I'm happy to merge my PR.

fingoldo commented 1 year ago

Created a fork with aforementioned functionality, as this repo seems to be dead.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

R1j1t commented 1 year ago

Thanks @fingoldo for the suggestion. Do you mind creating a PR and I can review it over the weekend and merge it if all looks okay!

I was not able to give much time to the repo recently but I plan to change it! Always happy to accept contributions and discuss new ideas!

fingoldo commented 1 year ago

Thanks @fingoldo for the suggestion. Do you mind creating a PR and I can review it over the weekend and merge it if all looks okay!

I was not able to give much time to the repo recently but I plan to change it! Always happy to accept contributions and discuss new ideas!

Thanks for responding! Submitted the PR. It worked for me but not 100% sure it will work on every edge case. I hope you have some tests to run.

R1j1t commented 1 year ago

Thanks for the PR @fingoldo, I can review the PR and work on this to get this published soon!

stale[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

fingoldo commented 11 months ago

Thanks for the PR @fingoldo, I can review the PR and work on this to get this published soon!

Hi, is there any progress with merging? Any way I can help?

R1j1t commented 11 months ago

I am sorry for the delay, I was occupied with couple of other stuff. I will fix the pipeline issues over the weekend and we'll see from there!

robinsonkwame commented 10 months ago

Hi, is there any progress with merging?

fingoldo commented 10 months ago

Hi, is there any progress with merging?

Thanks for asking, I have adjusted the PR according to the instructions of Rajat, let's wait for his checking and hopefully approving now )

fingoldo commented 9 months ago

I am sorry for the delay, I was occupied with couple of other stuff. I will fix the pipeline issues over the weekend and we'll see from there!

Can we make the last effort please and finish the undertaking? )