Adding Damerau, case-insensitivity comparison, Bayesian decision making options

fingoldo commented 1 year ago

Describe the problem you met

Hi, thanks for this useful lib, I think it can be improved even further! ;-)

I was disappointed by the lib output for this case:

doc = nlp(
    "Visa to dubia ",
)

print(doc._.performed_spellCheck)
print(doc._.outcome_spellCheck)
print(doc._.score_spellCheck)

True Visa to. {dubia: [('.', 0.72154), (';', 0.07971), ('?', 0.04526), ('Norway', 0.00669), ('Estonia', 0.00593), ('Taiwan', 0.00555), ('Bermuda', 0.00531), ('Iceland', 0.00479), ('Italy', 0.00466), ('Germany', 0.00437)]}

Describe the solution you'd like Obviously, top_n parameter which is internal to the candidate_ranking method should be made a part of constructor. I made that adjustment locally, and unexpectedly, Cuba became a winner instead of Dubai. 2 reasons were immediately evident: 1) candidates to misspell comparisons are case-sensitive, which is not good for accuracy 2) symbols swaps are considered of the same cost as 1 addition and one deletion, which is clearly a drawback.

Therefore, optional case insensitivity and Damerau-Levenshtein distance instead of plain Levenshtein are desired. Once implemented locally, I arrived at the desired outcome (Dubai).

However, it felt as a lucky coincidence, as the probabilities coming from underlying Bert were not accounted for when making a decision. I thought it would be very natural to be able to handle them in conjunction with the textual similarities in a Bayesian fashion, where Bert probs could be our prior, cand to misspell textual sim would be B/A conditional probs of B being the correct guess given that A was typed in. To soften the impact of possible Bert miscalibration, absolute probs conversion to ranks would be desired.

Summary of my suggestions: feature request could be implemented by adding a few parameters to the class constructor:

top_n (int, optional): suggestions from underlying ANN model to be considered. Defaults to 10. lowercased_distance (bool, optional): lowercase candidates before computing edit distance. Defaults to True. damerau_distance (bool, optional): additionally account for symbol swaps when calculating a distance. Defaults to True. bayes_selection (bool, optional): use bayes reasoning when selecting the best candidate. Bert probabilities are the prior, textual similarities of candidates to the input are treated as the probabilities B/A that the correct candidate is A, while the input was B. Defaults to True. ranked_bert_probs (bool, optional): use ranked probs as opposed to the absolute probs values coming from Bert. Defaults to True.

Calling code would become: contextualSpellCheck.add_to_pipe(nlp,config=dict(top_n=500,lowercased_distance =True,damerau_distance =True,bayes_selection =True,ranked_bert_probs =False))

After the implementation, I arrive at Dubai instead of the dot, which makes me happy )

Describe alternatives you've considered I was not able to find any alternative solution. Tried a few Spacy models, to no avail.

Additional context I used

            Doc.set_extension("bayes_probs", default=None)
            Doc.set_extension("bayes_details", default=None)

to provide additional details for the occasional exigent user:

screenshot detailing resulting candidates rating and bayesian details

If you think the community can benefit from this new functionality, I'm happy to merge my PR.

fingoldo commented 1 year ago

Created a fork with aforementioned functionality, as this repo seems to be dead.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

R1j1t commented 1 year ago

Thanks @fingoldo for the suggestion. Do you mind creating a PR and I can review it over the weekend and merge it if all looks okay!

I was not able to give much time to the repo recently but I plan to change it! Always happy to accept contributions and discuss new ideas!

fingoldo commented 1 year ago

Thanks @fingoldo for the suggestion. Do you mind creating a PR and I can review it over the weekend and merge it if all looks okay!

I was not able to give much time to the repo recently but I plan to change it! Always happy to accept contributions and discuss new ideas!

Thanks for responding! Submitted the PR. It worked for me but not 100% sure it will work on every edge case. I hope you have some tests to run.

R1j1t commented 1 year ago

Thanks for the PR @fingoldo, I can review the PR and work on this to get this published soon!

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

fingoldo commented 1 year ago

Thanks for the PR @fingoldo, I can review the PR and work on this to get this published soon!

Hi, is there any progress with merging? Any way I can help?

R1j1t commented 1 year ago

I am sorry for the delay, I was occupied with couple of other stuff. I will fix the pipeline issues over the weekend and we'll see from there!

robinsonkwame commented 1 year ago

Hi, is there any progress with merging?

fingoldo commented 1 year ago

Hi, is there any progress with merging?

Thanks for asking, I have adjusted the PR according to the instructions of Rajat, let's wait for his checking and hopefully approving now )

fingoldo commented 12 months ago

I am sorry for the delay, I was occupied with couple of other stuff. I will fix the pipeline issues over the weekend and we'll see from there!

Can we make the last effort please and finish the undertaking? )

R1j1t / contextualSpellCheck

Adding Damerau, case-insensitivity comparison, Bayesian decision making options #88