Open jbauer2718 opened 7 months ago
If someone adds me as a contributor, I am happy to fix this issue and write a test for it.
@jbauer2718 many thanks for reporting this issue and offering to fix it. You can create a PR based on changes from your fork and we can look at it.
Hey @sakoush , just added the above-linked PR for the team's review.
Because of the fact that Japanese mixes phonetic scripts and Chinese characters, special algorithms and dictionaries are needed to run tokenizers for these these models. A popular example of this is the BERT Japanese model:
https://huggingface.co/transformers/v4.11.3/_modules/transformers/models/bert_japanese/tokenization_bert_japanese.html
Without these dependencies, mlserver_huggingface/common.py errors when trying to load the tokenizer in the pipeline.
To reproduce, use any Japanese model. Here is an example.