R1j1t / contextualSpellCheck

✔️Contextual word checker for better suggestions
MIT License
405 stars 56 forks source link

[BUG] Sentence context greater than 512 character #64

Open xei opened 3 years ago

xei commented 3 years ago

I tried to correct spelling mistakes in a large text.

import spacy
import contextualSpellCheck

spacy_nlp = spacy.load(
    'en_core_web_sm',
    # disable=['ner']
    disable=['parser', 'ner'] # disable extra componens for efficiency
)
contextualSpellCheck.add_to_pipe(spacy_nlp)

corpus_spacy = [spacy_nlp(doc) for doc in corpus_raw]

At first, I faced this error: ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe('sentencizer'). Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting doc[i].is_sent_start.

So, I added the sentencizer component to the pipeline.

import spacy
import contextualSpellCheck

spacy_nlp = spacy.load(
    'en_core_web_sm',
    # disable=['ner']
    disable=['parser', 'ner'] # disable extra componens for efficiency
)
spacy_nlp.add_pipe('sentencizer')
contextualSpellCheck.add_to_pipe(spacy_nlp)

corpus_spacy = [spacy_nlp(doc) for doc in corpus_raw]

This time I faced this error: RuntimeError: The expanded size of the tensor (837) must match the existing size (512) at non-singleton dimension 1. Target sizes: [1, 837]. Tensor sizes: [1, 512]

I guess this is due to the limitations of BERT. However, I believe that there should be a way to catch this error and bypass the spell check.

R1j1t commented 3 years ago

Thanks @xei for reporting this issue. I know BERT has a limit of 512 characters and the model currently being used for inference was trained with maximum 512 characters REF.

Also, I am not sure how corpus_raw looks like. But 512 character should work for most cases as the spell check only considers a sentence for the context to spell checking and not the entire corpus.

For Example: ```python >>> import spacy >>> import contextualSpellCheck >>> spacy_nlp = spacy.load( 'en_core_web_sm', # disable=['ner'] disable=['parser', 'ner'] # disable extra componens for efficiency ) >>> spacy_nlp.add_pipe('sentencizer') >>> contextualSpellCheck.add_to_pipe(spacy_nlp) >>> corpus_raw="""The train from the west that bore Bert Bryant to New York was two hours late, for all the way from Clinton, Ohio, where Bert lived, the snow had been from four inches to a foot in depth. Consequently he had missed the one o’clock train for Mt. Pleasant and had spent an hour with his face glued to a waiting-room window watching the bustle and confusion of New York. Now, at four o’clock, he was seated in a sleigh, his suit-case between his feet, winding up the long, snowy road to Mt. Pleasant Academy. In the front seat was the fur-clad driver and beside him was Bert’s small trunk. It was very cold and fast growing dark. It seemed to Bert that they had been driving for miles and miles, and he wanted to ask the driver how much farther they had to go. But the man in the old bearskin coat was cross and taciturn, and so Bert buried his hands still deeper in his pockets and wondered whether his nose and ears were getting white. And just when he had decided that they were the sleigh left the main road with a sudden lurch, that almost toppled the trunk off, and turned through a gate and up a curving drive lined with snow-laden evergreens. Then the academy came into view, a rambling, comfortable-looking building with many cheerfully lighted windows looking out in welcome. At one of the windows two faces appeared in response to the warning of the sleigh bells and peered curiously down. The sleigh pulled up in front of a broad stone step and Bert clambered out, bag in hand. The driver lifted the trunk, opened the big oak door without ceremony, deposited his burden just inside and growled: “Fifty cents.”""" >>> doc = spacy_nlp(corpus_raw) >>> doc._.suggestions_spellCheck {Bert: 'Bert', Bryant: 'back', York: 'York', Clinton: 'Canton', Ohio: 'Ohio', Bert: 'he', bustle: 'noise', York: 'York', sleigh: 'seat', snowy: 'dusty', Bert: 'Ben', Bert: 'Bond', bearskin: 'black', taciturn: 'stern', Bert: 'he', sleigh: 'pair', lurch: 'turn', toppled: 'ripped', evergreens: 'trees', rambling: 'big', cheerfully: 'carefully', lighted: 'painted', sleigh: 'church', sleigh: 'coach', Bert: 'Ben', clambered: 'climbed'} ```

As you can see above the entire text moved through the spacy pipeline without any error. The sample text is taken from The Project Gutenberg eBook of The Junior Trophy, by Ralph Henry Barbour REF.

There is another thing which I wanted to point was contextualSpellCheck would require both parser and ner as mentioned here:

We require NER to identify if a token is a PERSON also require parser because we use Token.sent for context

Please let me know if you have any questions. I think your suggestion is great, and I will have to try to think of a solution to either split a large sentence (> max_position_embeddings) or bypass spell check altogether. If you would like to contribute this feature feel free to create a PR!