Open fingoldo opened 1 year ago
Created a fork with aforementioned functionality, as this repo seems to be dead.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Thanks @fingoldo for the suggestion. Do you mind creating a PR and I can review it over the weekend and merge it if all looks okay!
I was not able to give much time to the repo recently but I plan to change it! Always happy to accept contributions and discuss new ideas!
Thanks @fingoldo for the suggestion. Do you mind creating a PR and I can review it over the weekend and merge it if all looks okay!
I was not able to give much time to the repo recently but I plan to change it! Always happy to accept contributions and discuss new ideas!
Thanks for responding! Submitted the PR. It worked for me but not 100% sure it will work on every edge case. I hope you have some tests to run.
Thanks for the PR @fingoldo, I can review the PR and work on this to get this published soon!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Thanks for the PR @fingoldo, I can review the PR and work on this to get this published soon!
Hi, is there any progress with merging? Any way I can help?
I am sorry for the delay, I was occupied with couple of other stuff. I will fix the pipeline issues over the weekend and we'll see from there!
Hi, is there any progress with merging?
Hi, is there any progress with merging?
Thanks for asking, I have adjusted the PR according to the instructions of Rajat, let's wait for his checking and hopefully approving now )
I am sorry for the delay, I was occupied with couple of other stuff. I will fix the pipeline issues over the weekend and we'll see from there!
Can we make the last effort please and finish the undertaking? )
Describe the problem you met
Hi, thanks for this useful lib, I think it can be improved even further! ;-)
I was disappointed by the lib output for this case:
Describe the solution you'd like Obviously, top_n parameter which is internal to the candidate_ranking method should be made a part of constructor. I made that adjustment locally, and unexpectedly, Cuba became a winner instead of Dubai. 2 reasons were immediately evident: 1) candidates to misspell comparisons are case-sensitive, which is not good for accuracy 2) symbols swaps are considered of the same cost as 1 addition and one deletion, which is clearly a drawback.
Therefore, optional case insensitivity and Damerau-Levenshtein distance instead of plain Levenshtein are desired. Once implemented locally, I arrived at the desired outcome (Dubai).
However, it felt as a lucky coincidence, as the probabilities coming from underlying Bert were not accounted for when making a decision. I thought it would be very natural to be able to handle them in conjunction with the textual similarities in a Bayesian fashion, where Bert probs could be our prior, cand to misspell textual sim would be B/A conditional probs of B being the correct guess given that A was typed in. To soften the impact of possible Bert miscalibration, absolute probs conversion to ranks would be desired.
Summary of my suggestions: feature request could be implemented by adding a few parameters to the class constructor:
top_n (int, optional): suggestions from underlying ANN model to be considered. Defaults to 10. lowercased_distance (bool, optional): lowercase candidates before computing edit distance. Defaults to True. damerau_distance (bool, optional): additionally account for symbol swaps when calculating a distance. Defaults to True. bayes_selection (bool, optional): use bayes reasoning when selecting the best candidate. Bert probabilities are the prior, textual similarities of candidates to the input are treated as the probabilities B/A that the correct candidate is A, while the input was B. Defaults to True. ranked_bert_probs (bool, optional): use ranked probs as opposed to the absolute probs values coming from Bert. Defaults to True.
Calling code would become:
contextualSpellCheck.add_to_pipe(nlp,config=dict(top_n=500,lowercased_distance =True,damerau_distance =True,bayes_selection =True,ranked_bert_probs =False))
After the implementation, I arrive at Dubai instead of the dot, which makes me happy )
Describe alternatives you've considered I was not able to find any alternative solution. Tried a few Spacy models, to no avail.
Additional context I used
to provide additional details for the occasional exigent user:
screenshot detailing resulting candidates rating and bayesian details
If you think the community can benefit from this new functionality, I'm happy to merge my PR.