SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
11.6k stars 962 forks source link

Check for `tokenizer_config.json` before downloading tokenizer from HF Hub #834

Closed AmgadHasan closed 4 months ago

AmgadHasan commented 4 months ago

Check for a tokenizer_config.json file in the model directory. If present, infer the whisper model type from the tokenizer config and download the correct tokenizer for this model type. This change is made to support inferring whisper-large-v3 from tokenizer config.

Previously, if there was no tokenizer.json file but there was a tokenizer_config.json file, the loader ignores this file and automatically downloads the tokenizer from whisper-tiny.en or whisper-tiny. This caused issues for models derived from whisper-large-v3 as it has a different tokenizer from these two models. For example, this caused the model to do translation even if the user specified the task as "transcribe" since these tokens have different token ids for large-v3.

trungkienbkhn commented 4 months ago

@AmgadHasan, hello. Currently, most FW conversion models have a tokenizer.json file in the model path and do not have a tokenizer_config.json file. So I believe your case is not very common. If your large v3 model is missing the tokenizer.json file, I think adding this file to the path is simpler than changing the FW code to adapt to it (and also need to add the tokenizer_config.json file). For the conversion command, you can add the option --copy_files tokenizer.json to include the tokenizer file in the model path.