R1j1t commented 3 years ago

Is your feature request related to a problem? Please describe. The current logic of misspelling identification relies on vocab.txt from the transformer model. BERT tokenisers break not such common words into subwords and subsequently store the sub-words in vocab.txt. Hence the original word might not be present in vocab.txt and be identified as misspelt.

Describe the solution you'd like Still not clear, need to look into some papers on this.

Describe alternatives you've considered Alternate which I can think of right now will be 2 folds:

ask user to provide list of such words and append in the vocab.txt from the transformers model
if the proposed change is ##x then check the editdistance from detokenised form of that word + previous word

Additional context

3994

letconex commented 3 years ago

Concerning the logic: Is this a viable response?

>>> doc = nlp("This is a majour mistaken.")
>>> print(doc._.outcome_spellCheck)
This is a fact mistaken.
>>> doc = nlp("This is a majour mistake.")
>>> print(doc._.outcome_spellCheck)
This is a major mistake.
>>> doc = nlp("This is a majour mistakes.")
>>> print(doc._.outcome_spellCheck)
This is a for mistakes.
>>> doc = nlp("This is a majour misstake.")
>>> print(doc._.outcome_spellCheck)
This is a minor story.

R1j1t commented 3 years ago

That is not the desired response. But it is based on the current logic. If you want to improve accuracy, please try pass the vocab file https://github.com/R1j1t/contextualSpellCheck/blob/15b30ebf5834ec099e6292d874c918db3317b2a3/contextualSpellCheck/contextualSpellCheck.py#L34-L35

This will help model prevent False positives. Feel free to open a PR with a fix!

kshitij12345 commented 3 years ago

One side-effect of using the current transformers tokenizer logic is that it would by default support multi-lingual models. Otherwise I am not sure but I think different languages might require different spell-checkers as per the language nuances.

R1j1t commented 3 years ago

As mentioned in the README

This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model.

So lets say you want to perform spell correction on Japanese sentence:

provide Japanese spacy model: This will break the sentence into tokens. Now as this model is trained on Japanese language it knows the nuances (better than english model)
Provide the Japanese bert model (from tokenizer models): Which will provide the candidate word for OOV word. Note that vocabulary here is considered of the transformer model and not the spaCy model

Below is some code contributed to the repo for Japanese language:

https://github.com/R1j1t/contextualSpellCheck/blob/f8cbeb8a7d5dc085f9f8cc5d27d390848d2df274/examples/ja_example.py#L4-L13

contextualSpellCheck examples folder

I hope it answers your question @kshitij12345. Please feel free to provide ideas or reference if you find something I might have missed something here!

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

piegu commented 3 years ago

The current logic of misspell identification relies on vocab.txt from the transformer model. For not so common words tokenizers breaks them into subwords and hence the original entire word might be present as in in vocab.txt

HI @R1j1t,

First of all, congratulations for you Contextual Spell Checker (CSC) based on spaCy and BERT (transformer model).

As I'm searching for this kind of tool, I tested your CSC and I can give the following feedback:

your CSC is an universal Spell Checker as it is possible to dowload the spaCy and BERT model of a language other than English. For example, this is my code for using your CSC in Portuguese in a Colab notebook:

# Installation
!pip install -U pip setuptools wheel
!pip install -U spacy
!pip install contextualSpellCheck

# spaCy model in Portuguese
spacy_model = "pt_core_news_md" # 48MB, or "pt_core_news_sm" (20MB), or "pt_core_news_lg"  (577MB)
!python -m spacy download {spacy_model} 

# BERT model in Portuguese
model_name = "neuralmind/bert-base-portuguese-cased" # or "neuralmind/bert-large-portuguese-cased"

# Importation and instantiation of the spaCy model
import spacy
import contextualSpellCheck
nlp = spacy.load(spacy_model)

# Download BERT model and add contextual spellchecker to the spaCy model
nlp.add_pipe(
    "contextual spellchecker",
    config={
        "model_name": model_name,
        "max_edit_dist": 2,
    },
);

# Sentence with errors ("milões" instead of "milhões")
sentence = "A receita foi de $ 9,4 milões em comparação com o ano anterior de $ 2,7 milões."

# Get sentence with corrections (if errors found by CSC)
doc = nlp(sentence)
print(f'({doc._.performed_spellCheck}) {doc._.outcome_spellCheck}')

# (True) A receita foi de $ 9,4 milhões em comparação com o ano anterior de $ 2,7 milhões.

your CSC is an unigram Spell Checker as it uses the [MASK] token of a BERT model to replace a so-called mispelling word by a token from the BERT tokenizer vocab (see post). That means that your CSC can not correct a bigram error for example (see following example).

sentence = "a horta abdominal" # the correct sentence in Portuguese is "aorta abdominal"
doc = nlp(sentence)
print(f'({doc._.performed_spellCheck}) {doc._.outcome_spellCheck}')

# (False) 
# the CSC did not find corrected words with an edit distance < max_edit_dist

your CSC is a word corrector by replacing non vocab words with tokens from the BERT tokenizer vocab (if the their edit distances are inferior to the max_edit_dist). That is the true issue I think (ie, using a BERT model). In fact, by using BERT models, I do not see how your CSC will be able to correct words instead of replacing them. It is true you can pass an infinite vocab file that will allow to detect most of mispelling words but as already said, your CSC will only be able to replace them by one token of the BERT tokenizer vocab (a token is not necessarily a word in the Wordpiece BERT tokenizer that uses subwords as tokens). This means that a "solution" would be to use finetuned BERT models with gigantic vocabulary (in order to have whole words instead of sub-words). Unfortunately, this kind of finetuning would require a huge corpus of texts. And even so, your CSC spell checker would remain a unigram one.

Could you consider exploring another type of transformer model like T5 (or ByT5) which has a seq2seq architecture (BERT as encoder mas GPT as decoder) allowing to have sentences of different sizes in input and output of the model?

R1j1t commented 2 years ago

Hey @piegu, first of I want to thank you for your feedback. It feels terrific to have contributors, and even more so, who help in shaping the logic! When I started this project, I wanted the library to be generalized for multiple languages, hence spaCy and BERT's approach. I created tasks for me (#44, #40), and I would like to read more on these topics. But lately, I have been occupied with my day job and have limited my contributions to contextualSpellCheck.

Regarding your 2nd point, it is something I would agree I did not know. As pointed out in the comment by sgugger:

For this task, you need to either use a different model (coded yourself as it's not present in the library) or have your training set contain one [MASK] per token you want to mask. For instance if you want to mask all the tokens corresponding to one word (a technique called whole-word masking) what is typically done in training scripts is to replace all parts of one word by [MASK]. For pseudogener tokenized as pseudo, ##gene, that would mean having [MASK] [MASK].

I would still want to depend on transformer models, as it adds the functionality of multilingual support. I will try to experiment with your suggestions and try to think of a solution myself for the same.

Hope you like the project. Feel free to contribute!

wanglc02 commented 8 months ago

I noticed that part of the logic of misspell_identify is:

        misspell = []
        for token in docCopy:
            if (
                (token.text.lower() not in self.vocab)

Will changing token.text.lower() into token._lemma.lower() improve accuracy? According to https://spacy.io/api/lemmatizer, "as of v3.0, the Lemmatizer is a standalone pipeline component that can be added to your pipeline, and not a hidden part of the vocab that runs behind the scenes. This makes it easier to customize how lemmas should be assigned in your pipeline." So, the __contains__ method of self.vocab will not convert a token to its base form. We have to get the base form by token._lemma.

R1j1t / contextualSpellCheck

Update the logic of misspell identification #44

30 https://github.com/explosion/spaCy/issues/3994