Words being corrected ##ts [BUG]

nicno90 commented 4 years ago

Describe the bug Words tagged as incorrect are replaced with a word with hashtags.

To Reproduce

#Steps to reproduce the behavior:
>>> import spacy
>>> nlp = spacy.load('en_core_web_lg', disable=['tagger'])
>>> from contextualSpellCheck import ContextualSpellCheck
2020-10-14 10:24:16.775668: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
>>> merge_ents = nlp.create_pipe("merge_entities")
>>> nlp.add_pipe(merge_ents)
>>> spell_checker = ContextualSpellCheck(max_edit_dist=3)
>>> nlp.add_pipe(spell_checker)
>>> sent = 'Everyone has to help to fix the problems of society. There has to be more training, more opportunity to bridge the gap between the haves and the have nots.'
>>> doc = nlp(sent)
>>> correct = doc._.outcome_spellCheck
>>> correct
'Everyone has to help to fix the problems of society. There has to be more training, more opportunity to bridge the gap between the have and the have ##ts.'

Expected behavior 'Everyone has to help to fix the problems of society. There has to be more training, more opportunity to bridge the gap between the have and the have nots.' or 'Everyone has to help to fix the problems of society. There has to be more training, more opportunity to bridge the gap between the have and the have not.'

Version:

contextualSpellCheck 0.3.0
Spacy: 2.3.2
transformers 3.3.1

Additional information I checked the vocab.txt and there are words with ## in the word. I am wondering what the need for these are.

R1j1t commented 4 years ago

Thanks for reporting it @nicno90 I will look into it and see why it is coming.

ajay-sreeram commented 4 years ago

@R1j1t , and also it will be nice if top_n is configurable

saheel1115 commented 3 years ago

I am also getting similar errors. Getting ##net and ##ER as corrections.

R1j1t commented 3 years ago

@saheel1115 can you send be the following: Input sentence: Output: Expected:

I will try to fix it by this weekend.

R1j1t commented 3 years ago

@nicno90 I have fixed this issue in PR #36. So for your input Input: Everyone has to help to fix the problems of society. There has to be more training, more opportunity to bridge the gap between the haves and the have nots. Output: Everyone has to help to fix the problems of society. There has to be more training, more opportunity to bridge the gap between the have and the havets.

This is what will come after fixing the detokenization in my code. Regarding your question about such tokens(##x), please have a look here: https://github.com/google/sentencepiece

I will release the latest package to pip over the weekend. Thanks for pointing out this issue also please feel free contribute!

R1j1t commented 3 years ago

Latest package released on PyPi! Release: https://github.com/R1j1t/contextualSpellCheck/releases/tag/v0.3.3 PyPi Link: https://pypi.org/project/contextualSpellCheck/

R1j1t / contextualSpellCheck

Words being corrected ##ts [BUG] #30