ckiplab / han-transformers

7 stars 1 forks source link

pyo3_runtime.PanicException: AddedVocabulary bad split #1

Open kalvinchang opened 1 year ago

kalvinchang commented 1 year ago

The following code triggered pyo3_runtime.PanicException: AddedVocabulary bad split

from transformers import pipeline
classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws-xiandai")

def word_segment(sentence):
    segmented = classifier(sentence)
    sentence = []
    for word in segmented:
        sentence.append(word['word'])
    return sentence

print(word_segment("我想去吃飯"))

(both transformers 4.22.1 and 4.30.0)

thread '' panicked at 'AddedVocabulary bad split', tokenizers-lib/src/tokenizer/added_vocabulary.rs:360:22 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace Traceback (most recent call last): File "/Users/kalvin/Research/speech/asru-23/utils/phrase_translate.py", line 104, in File "/Users/kalvin/Research/speech/asru-23/utils/phrase_translate.py", line 98, in word_segment for word in segmented: File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/token_classification.py", line 192, in call return super().call(inputs, kwargs) File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1074, in call return self.run_single(inputs, preprocess_params, forward_params, postprocess_params) File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1080, in run_single model_inputs = self.preprocess(inputs, preprocess_params) File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/token_classification.py", line 196, in preprocess model_inputs = self.tokenizer( File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2484, in call encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs) File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2590, in _call_one return self.encode_plus( File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2663, in encode_plus return self._encode_plus( File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 500, in _encode_plus batched_output = self._batch_encode_plus( File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 427, in _batch_encode_plus encodings = self._tokenizer.encode_batch( pyo3_runtime.PanicException: AddedVocabulary bad split

weihanchen commented 1 year ago

is there any solution?

kalvinchang commented 1 year ago

Not that I know of :/

Jiahao004 commented 8 months ago

Hi, I came across the same issue, after I added new vocabulary to the "bert-base-multilingual-cased" tokenizer. May I know your solution?