Open vkras opened 4 days ago
1) Whisper's model can just miss something in transcription for no apparent reason 2) A one byte change in audio can trigger a different result 3) A one token change in prompt can trigger different result
Btw, for me it's opposite. with VAD "8892" appears, without VAD it disappears. 😄
Maybe for model it's unusual to start with digits, try initial_prompt="OK"
I'm attaching an audio file (it's reproducible with longer files split into chunks). Disabling VAD helps but it does not explain the issue because VAD correctly identifies where speech stars (around 2.5 seconds). It affects both batch and non-batch methods.
With VAD: chunks_metadata [{'start_time': 2.416, 'end_time': 12.72}] duration_after_vad 10.304 Sentence: [0 7.83s -> 12.13s] It's important that that first piece can't be misinterpreted as a decimal.
Without VAD: chunks_metadata [{'start_time': 0.0, 'end_time': 13.11925}] duration_after_vad 13.11925 Sentence: [0 3.42s -> 12.14s] 8892. It's important that that first piece can't be misinterpreted as a decimal.
digit-speech.zip