Open kalvinchang opened 1 year ago
is there any solution?
Not that I know of :/
Hi, I came across the same issue, after I added new vocabulary to the "bert-base-multilingual-cased" tokenizer. May I know your solution?
Hi, I came across the same issue, after I added new vocabulary to the "bert-base-multilingual-cased" tokenizer. May I know your solution?
I had the same issue as well, but my base model is "bart-base-chinese"
The following code triggered pyo3_runtime.PanicException: AddedVocabulary bad split
(both transformers 4.22.1 and 4.30.0)
thread '' panicked at 'AddedVocabulary bad split', tokenizers-lib/src/tokenizer/added_vocabulary.rs:360:22
note: run with
File "/Users/kalvin/Research/speech/asru-23/utils/phrase_translate.py", line 98, in word_segment
for word in segmented:
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/token_classification.py", line 192, in call
return super().call(inputs, kwargs)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1074, in call
return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1080, in run_single
model_inputs = self.preprocess(inputs, preprocess_params)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/pipelines/token_classification.py", line 196, in preprocess
model_inputs = self.tokenizer(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2484, in call
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2590, in _call_one
return self.encode_plus(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2663, in encode_plus
return self._encode_plus(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 500, in _encode_plus
batched_output = self._batch_encode_plus(
File "/Users/kalvin/opt/miniconda3/envs/phonology/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 427, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
pyo3_runtime.PanicException: AddedVocabulary bad split
RUST_BACKTRACE=1
environment variable to display a backtrace Traceback (most recent call last): File "/Users/kalvin/Research/speech/asru-23/utils/phrase_translate.py", line 104, in