R1j1t / contextualSpellCheck

✔️Contextual word checker for better suggestions
MIT License
405 stars 56 forks source link

Bad performance for other language #76

Closed JuanFF closed 1 year ago

JuanFF commented 2 years ago

Hello, I'm trying to use the contextual spell checker for Spanish. I run the script in https://github.com/R1j1t/contextualSpellCheck/blob/88bbbb46252c534679b185955fd88c239ed548a7/examples/ja_example.py with the following custom configuration:

import spacy
import contextualSpellCheck

nlp = spacy.load("es_dep_news_trf")

nlp.add_pipe(
    "contextual spellchecker",
    config={
        "model_name": "bert-base-multilingual-cased",
        "max_edit_dist": 2,
    },
)

doc = nlp("La economia a crecido un dos por ciento.")
print(doc._.performed_spellCheck)
print(doc._.outcome_spellCheck)

but I don't get the desired result

La economia a crecido un dos por ciento should be corrected as La economía ha crecido un dos por ciento Instead, I get La economia a crecido un dos por cento

If I use another pre-trained model (e.g. "model_name": "PlanTL-GOB-ES/roberta-large-bne") , the result keeps wrong: Laeconomiaacrecidoundosporciento.?? I wonder if I'm using the proper script to run the spellchecker in another language.

R1j1t commented 2 years ago

Hi @JuanFF, I have the following 2 observations:

  1. contextualSpellCheck would be unable to change "a" to "ha". Details here
  2. The problem with "ciento" is because of the bert model bert-base-multilingual-cased. Suppose the user passes no vocabulary (vocab) file. In that case, it uses the vocab of the bert model, and "ciento" is not available in it:

    ```
    >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
    >>> 'ciento' in tokenizer.get_vocab()
    False
    >>> doc._.suggestions_spellCheck
    {ciento: 'cento'}
    >>> # 'cento' is hundred in Portuguese (Brazil)
    >>>
    ```

If you dont want to change the bert model, I would suggest to pass the vocab file (example) separately like:


>>> vocab_path = "es_vocab.txt" 
>>> 
>>> nlp.add_pipe(
...     "contextual spellchecker",
...     config={
...             "model_name": "bert-base-multilingual-cased",
...             "max_edit_dist": 2,
...             "vocab_path": vocab_path
...     },
... )
testVocab.txt
inside vocab path
file opened!
Inside [unused....]
<contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck object at 0x7fa607daee80>
>>> doc = nlp("La economia a crecido un dos por ciento.")
>>> print(doc._.performed_spellCheck)
True
>>> print(doc._.outcome_spellCheck)
La economia a crecido un dos por ciento.
>>> 
R1j1t commented 2 years ago

I have a pending issue https://github.com/R1j1t/contextualSpellCheck/issues/44 on a similar topic, but lately, I have been pretty occupied. If you think you can contribute, please open a PR! The project would be glad to have your contribution!

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.