SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
11.9k stars 999 forks source link

Issues with automatic language detection #918

Closed jacobtang closed 2 months ago

jacobtang commented 2 months ago

Does fast-whisper support specifying transcription in 2-3 selected languages,for example, can it transcribe both English and Chinese simultaneously? When using the auto language recognition mode, the transcription results may include English, Chinese, Korean, and Italian texts.

ngcheeyuan commented 2 months ago

Hi Jacob. You should take a look at Whisper's architecture.

Language prediction is done only on the first 30 seconds of the audio, and subsequent chunks will be using the same predicted language.

You might want to try diarization then run whisper on the separated chunks. This is a limitation of the model and it can't handle more than 1 language well.

Edit: I mean it can handle code switching a little if the text often appears together.

Jiltseb commented 2 months ago

With the new commit, it checks for language every 30 sec and reroutes to transcribe or translate as needed (based on output_language parameter), if you set multilingual=True. While a bit hacky, it can transcribe code switched content with a margin of error.

ngcheeyuan commented 2 months ago

@trungkienbkhn har that's interesting. Does it still take the previous text as context? I wonder how that performs. And this 30 seconds chunks, how does it work with vad?

Jiltseb commented 2 months ago

Code-switching is an optional feature only in sequential execution and not in the batched version (you can check the function arguments in both cases), so it takes the previous context if not specified otherwise.