dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

[Error Message] Improve error message in SentencepieceTokenizer when arguments are not expected. #1449

Open preeyank5 opened 3 years ago

preeyank5 commented 3 years ago

Description

While using tokenizers.create with the model and vocab file for a custom corpus, the code throws an error and is not able to generate the BERT vocab file

Error Message

ValueError: Mismatch vocabulary! All special tokens specified must be control tokens in the sentencepiece vocabulary.

To Reproduce

from gluonnlp.data import tokenizers tokenizers.create('spm', model_path='lsw1/spm.model', vocab_path='lsw1/spm.vocab')

spm.zip

sxjscience commented 3 years ago

Actually I can load the model:

import gluonnlp
from gluonnlp.data.tokenizers import SentencepieceTokenizer
tokenizer = SentencepieceTokenizer(model_path='spm.model', vocab='spm.vocab')
print(tokenizer)

Output:

SentencepieceTokenizer(
   model_path = /home/ubuntu/spm.model
   lowercase = False, nbest = 0, alpha = 0.0
   vocab = Vocab(size=3500, unk_token="<unk>", bos_token="<s>", eos_token="</s>", pad_token="<pad>")
)

@preeyank5 Would you try again?

sxjscience commented 3 years ago

I find that the root cause is that we will need better error handling of the **kwargs here. Basically, the argument should be vocab instead of vocab_path and vocab_path has been put under **kwargs.

The way to fix the issue is to revise https://github.com/dmlc/gluon-nlp/blob/08dc6ed8f38f6c2576836e2352ea5ee4168eb413/src/gluonnlp/data/tokenizers/sentencepiece.py#L99-L101

sxjscience commented 3 years ago

Marked it as a "good first issue" because it's a good issue for early contributors. We can just ensure that the correct error is raised when kwargs contains unexpected values.

preeyank5 commented 3 years ago

Thanks Xingjian, I am now able to load the model

sxjscience commented 3 years ago

Let's keep this issue to track the error message. We should raise the error if the user has specified some unexpected kwargs.

ConaGo commented 3 years ago

Hi, i am new to this Project and would like to tackle this issue

Abdullium commented 1 year ago

Hi, i am new to this Project and would like to tackle this issue

Have you Solved it yet