Install `fugashi`, `unidic`, `unidic-lite`, and `ipadic` as dependencies to MLServer HuggingFace to support hosting Japanese language models

jbauer2718 commented 7 months ago

Because of the fact that Japanese mixes phonetic scripts and Chinese characters, special algorithms and dictionaries are needed to run tokenizers for these these models. A popular example of this is the BERT Japanese model:

https://huggingface.co/transformers/v4.11.3/_modules/transformers/models/bert_japanese/tokenization_bert_japanese.html

Without these dependencies, mlserver_huggingface/common.py errors when trying to load the tokenizer in the pipeline.

To reproduce, use any Japanese model. Here is an example.

jbauer2718 commented 7 months ago

If someone adds me as a contributor, I am happy to fix this issue and write a test for it.

sakoush commented 7 months ago

@jbauer2718 many thanks for reporting this issue and offering to fix it. You can create a PR based on changes from your fork and we can look at it.

jbauer2718 commented 7 months ago

Hey @sakoush , just added the above-linked PR for the team's review.

SeldonIO / MLServer

Install `fugashi`, `unidic`, `unidic-lite`, and `ipadic` as dependencies to MLServer HuggingFace to support hosting Japanese language models #1506