ValueError: Input audio chunk is too short when transcribing numpy array

dancinkid6 commented 2 weeks ago

I am trying to transcribe real time audio with sounddevice and faster whisper.
instead of saving it to a temp file, i want to just pass the recorded numpy array directly to the model. but somehow i just cannot get it to work.

model = WhisperModel(
    "distil-small.en", 
    device="cuda", 
    compute_type="float16", 
    download_root="whisper_models", 
    local_files_only=True
    )

def process_buffer(audio_data):
    audio_data = np.concatenate(audio_data, axis=0)
    audio_data = audio_data.astype(np.float16)
    print(f"Audio data dtype after conversion: {audio_data.dtype}")

    segments, _ = model.transcribe(audio_data, vad_filter=True)

this just produces "ValueError: Input audio chunk is too short."
the audio chunks have no problem and it even works super good if i write it into a file first using soundfiles then transcribe with faster-whisper. but when i directly pass it to the model, it just breaks. the audio data is confirmed to be fp16 with no problem. i read something about it's the vad filter problem, and i played around with the parameters, it never worked.

it has been bothering me for the past 2 days. can anyone help? thanks

trungkienbkhn commented 2 weeks ago

@dancinkid6 , hello. You should compare the audio value in here when writing data to a file with the audio_data in your example above. BTW, could you show full code to reproduce this problem ?

dancinkid6 commented 2 weeks ago

i fixed it by going in and read the VAD code, end up just need to flatten the audio_data,

SYSTRAN / faster-whisper

ValueError: Input audio chunk is too short when transcribing numpy array #878