Open R1j1t opened 3 years ago
Concerning the logic: Is this a viable response?
>>> doc = nlp("This is a majour mistaken.")
>>> print(doc._.outcome_spellCheck)
This is a fact mistaken.
>>> doc = nlp("This is a majour mistake.")
>>> print(doc._.outcome_spellCheck)
This is a major mistake.
>>> doc = nlp("This is a majour mistakes.")
>>> print(doc._.outcome_spellCheck)
This is a for mistakes.
>>> doc = nlp("This is a majour misstake.")
>>> print(doc._.outcome_spellCheck)
This is a minor story.
That is not the desired response. But it is based on the current logic. If you want to improve accuracy, please try pass the vocab file https://github.com/R1j1t/contextualSpellCheck/blob/15b30ebf5834ec099e6292d874c918db3317b2a3/contextualSpellCheck/contextualSpellCheck.py#L34-L35
This will help model prevent False positives. Feel free to open a PR with a fix!
One side-effect of using the current transformers tokenizer logic is that it would by default support multi-lingual models. Otherwise I am not sure but I think different languages might require different spell-checkers as per the language nuances.
As mentioned in the README
This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model.
So lets say you want to perform spell correction on Japanese sentence:
Below is some code contributed to the repo for Japanese language:
contextualSpellCheck examples folder
I hope it answers your question @kshitij12345. Please feel free to provide ideas or reference if you find something I might have missed something here!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
The current logic of misspell identification relies on vocab.txt from the transformer model. For not so common words tokenizers breaks them into subwords and hence the original entire word might be present as in in vocab.txt
HI @R1j1t,
First of all, congratulations for you Contextual Spell Checker (CSC) based on spaCy and BERT (transformer model).
As I'm searching for this kind of tool, I tested your CSC and I can give the following feedback:
# Installation
!pip install -U pip setuptools wheel
!pip install -U spacy
!pip install contextualSpellCheck
# spaCy model in Portuguese
spacy_model = "pt_core_news_md" # 48MB, or "pt_core_news_sm" (20MB), or "pt_core_news_lg" (577MB)
!python -m spacy download {spacy_model}
# BERT model in Portuguese
model_name = "neuralmind/bert-base-portuguese-cased" # or "neuralmind/bert-large-portuguese-cased"
# Importation and instantiation of the spaCy model
import spacy
import contextualSpellCheck
nlp = spacy.load(spacy_model)
# Download BERT model and add contextual spellchecker to the spaCy model
nlp.add_pipe(
"contextual spellchecker",
config={
"model_name": model_name,
"max_edit_dist": 2,
},
);
# Sentence with errors ("milões" instead of "milhões")
sentence = "A receita foi de $ 9,4 milões em comparação com o ano anterior de $ 2,7 milões."
# Get sentence with corrections (if errors found by CSC)
doc = nlp(sentence)
print(f'({doc._.performed_spellCheck}) {doc._.outcome_spellCheck}')
# (True) A receita foi de $ 9,4 milhões em comparação com o ano anterior de $ 2,7 milhões.
sentence = "a horta abdominal" # the correct sentence in Portuguese is "aorta abdominal"
doc = nlp(sentence)
print(f'({doc._.performed_spellCheck}) {doc._.outcome_spellCheck}')
# (False)
# the CSC did not find corrected words with an edit distance < max_edit_dist
max_edit_dist
). That is the true issue I think (ie, using a BERT model). In fact, by using BERT models, I do not see how your CSC will be able to correct words instead of replacing them. It is true you can pass an infinite vocab file that will allow to detect most of mispelling words but as already said, your CSC will only be able to replace them by one token of the BERT tokenizer vocab (a token is not necessarily a word in the Wordpiece BERT tokenizer that uses subwords as tokens). This means that a "solution" would be to use finetuned BERT models with gigantic vocabulary (in order to have whole words instead of sub-words). Unfortunately, this kind of finetuning would require a huge corpus of texts. And even so, your CSC spell checker would remain a unigram one.Could you consider exploring another type of transformer model like T5 (or ByT5) which has a seq2seq architecture (BERT as encoder mas GPT as decoder) allowing to have sentences of different sizes in input and output of the model?
Hey @piegu, first of I want to thank you for your feedback. It feels terrific to have contributors, and even more so, who help in shaping the logic! When I started this project, I wanted the library to be generalized for multiple languages, hence spaCy and BERT's approach. I created tasks for me (#44, #40), and I would like to read more on these topics. But lately, I have been occupied with my day job and have limited my contributions to contextualSpellCheck.
Regarding your 2nd point, it is something I would agree I did not know. As pointed out in the comment by sgugger:
For this task, you need to either use a different model (coded yourself as it's not present in the library) or have your training set contain one [MASK] per token you want to mask. For instance if you want to mask all the tokens corresponding to one word (a technique called whole-word masking) what is typically done in training scripts is to replace all parts of one word by [MASK]. For pseudogener tokenized as pseudo, ##gene, that would mean having [MASK] [MASK].
I would still want to depend on transformer models, as it adds the functionality of multilingual support. I will try to experiment with your suggestions and try to think of a solution myself for the same.
Hope you like the project. Feel free to contribute!
I noticed that part of the logic of misspell_identify
is:
misspell = []
for token in docCopy:
if (
(token.text.lower() not in self.vocab)
Will changing token.text.lower()
into token._lemma.lower()
improve accuracy? According to https://spacy.io/api/lemmatizer, "as of v3.0, the Lemmatizer is a standalone pipeline component that can be added to your pipeline, and not a hidden part of the vocab that runs behind the scenes. This makes it easier to customize how lemmas should be assigned in your pipeline." So, the __contains__
method of self.vocab will not convert a token to its base form. We have to get the base form by token._lemma.
Is your feature request related to a problem? Please describe. The current logic of misspelling identification relies on
vocab.txt
from the transformer model. BERT tokenisers break not such common words into subwords and subsequently store the sub-words invocab.txt
. Hence the original word might not be present invocab.txt
and be identified as misspelt.Describe the solution you'd like Still not clear, need to look into some papers on this.
Describe alternatives you've considered Alternate which I can think of right now will be 2 folds:
Additional context
30 https://github.com/explosion/spaCy/issues/3994