SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
12.75k stars 1.07k forks source link

Number sequence is not transcribed when a chunk starts with it #1174

Open vkras opened 4 days ago

vkras commented 4 days ago

I'm attaching an audio file (it's reproducible with longer files split into chunks). Disabling VAD helps but it does not explain the issue because VAD correctly identifies where speech stars (around 2.5 seconds). It affects both batch and non-batch methods.

With VAD: chunks_metadata [{'start_time': 2.416, 'end_time': 12.72}] duration_after_vad 10.304 Sentence: [0 7.83s -> 12.13s] It's important that that first piece can't be misinterpreted as a decimal.

Without VAD: chunks_metadata [{'start_time': 0.0, 'end_time': 13.11925}] duration_after_vad 13.11925 Sentence: [0 3.42s -> 12.14s] 8892. It's important that that first piece can't be misinterpreted as a decimal.

digit-speech.zip

Purfview commented 4 days ago

1) Whisper's model can just miss something in transcription for no apparent reason 2) A one byte change in audio can trigger a different result 3) A one token change in prompt can trigger different result

Btw, for me it's opposite. with VAD "8892" appears, without VAD it disappears. 😄

Maybe for model it's unusual to start with digits, try initial_prompt="OK"