Check for `tokenizer_config.json` before downloading tokenizer from HF Hub

SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2

MIT License

11.6k stars 962 forks source link

Check for a tokenizer_config.json file in the model directory. If present, infer the whisper model type from the tokenizer config and download the correct tokenizer for this model type. This change is made to support inferring whisper-large-v3 from tokenizer config.

Previously, if there was no tokenizer.json file but there was a tokenizer_config.json file, the loader ignores this file and automatically downloads the tokenizer from whisper-tiny.en or whisper-tiny. This caused issues for models derived from whisper-large-v3 as it has a different tokenizer from these two models. For example, this caused the model to do translation even if the user specified the task as "transcribe" since these tokens have different token ids for large-v3.

SYSTRAN / faster-whisper

Check for `tokenizer_config.json` before downloading tokenizer from HF Hub #834