Closed jacobtang closed 2 months ago
Hi Jacob. You should take a look at Whisper's architecture.
Language prediction is done only on the first 30 seconds of the audio, and subsequent chunks will be using the same predicted language.
You might want to try diarization then run whisper on the separated chunks. This is a limitation of the model and it can't handle more than 1 language well.
Edit: I mean it can handle code switching a little if the text often appears together.
With the new commit, it checks for language every 30 sec and reroutes to transcribe or translate as needed (based on output_language
parameter), if you set multilingual=True
. While a bit hacky, it can transcribe code switched content with a margin of error.
@trungkienbkhn har that's interesting. Does it still take the previous text as context? I wonder how that performs. And this 30 seconds chunks, how does it work with vad?
Code-switching is an optional feature only in sequential execution and not in the batched version (you can check the function arguments in both cases), so it takes the previous context if not specified otherwise.
Does fast-whisper support specifying transcription in 2-3 selected languages,for example, can it transcribe both English and Chinese simultaneously? When using the auto language recognition mode, the transcription results may include English, Chinese, Korean, and Italian texts.