alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.37k stars 1.04k forks source link

Timestamps aren't working for me and I can't tell why #1603

Closed LinkBenjamin closed 3 days ago

LinkBenjamin commented 3 days ago

Hey, I've been experimenting with vosk and found it to do an incredible job of transcribing! I do a lot of social media work and I was hoping to use it to find certain phrases in a video and clip out short-form videos (tiktok/reels) based on the timestamps.

I extract the audio from my video just fine in python, and transcribe, and feed it to an LLM to find the clips - but when I add timestamp capture to my transcriber, i end up with blank results.

Here's an example of my python code attempting to transcribe - I recorded a quick m4a audio clip with Quicktime on my Mac and used it for my example.

When I run the below code, I get a successful output of the transcript from the line print(f"Final Result JSON: {final_result_json}") but I get print("No word-level results in the final result.") from the last if statement.

I've spent a couple hours trying to figure out what I'm doing wrong and I'm stuck. Can you tell me what I'm missing?

(Note - this isn't my final code, just something I threw together to debug my vosk api calls with, hence the extra print statements and stuff)

import os
import wave
import json
from pydub import AudioSegment
from vosk import Model, KaldiRecognizer

# Path to the Vosk model and audio file
model_path = "vosk-model-en-us-0.42-gigaspeech"
audio_file = "Untitled.m4a"
audio_file_2 = 'a.wav'
audio = AudioSegment.from_file(audio_file)
audio = audio.set_frame_rate(16000).set_channels(1)
audio.export(audio_file_2, format='wav')

# Load the Vosk model
if not os.path.exists(model_path):
    print("Please download the model from https://alphacephei.com/vosk/models and unpack as 'model' in the current folder.")
    exit(1)

model = Model(model_path)

# Open the audio file
wf = wave.open(audio_file_2, "rb")
if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
    print("Audio file must be WAV format mono PCM.")
    exit(1)

# Print audio file properties for debugging
print(f"Sample Rate: {wf.getframerate()}, Channels: {wf.getnchannels()}, Sample Width: {wf.getsampwidth()}, Compression Type: {wf.getcomptype()}")

# Create a recognizer object
rec = KaldiRecognizer(model, wf.getframerate())

# Read and process the audio file
results = []
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        result = rec.Result()
        results.append(json.loads(result))
    else:
        partial_result = rec.PartialResult()
        print(f"Partial Result: {partial_result}")  # Debugging info for partial results

# Get the final result and check its content
final_result = rec.FinalResult()
print(f"Raw Final Result: {final_result}")  # Print raw final result for debugging
final_result_json = json.loads(final_result)
print(f"Final Result JSON: {final_result_json}")

# Add the final result to the results list if it's not empty
if final_result_json:
    results.append(final_result_json)

# Print word-level timestamps if available in the final result
if 'result' in final_result_json:
    for word in final_result_json['result']:
        print(f"Word: {word['word']}, Start time: {word['start']}, End time: {word['end']}")
else:
    print("No word-level results in the final result.")

# Close the audio file
wf.close()
nshmyrev commented 3 days ago

Hi. You probalby miss recognizer.SetWords(True)

https://github.com/alphacep/vosk-api/blob/master/python/example/test_simple.py#L23

LinkBenjamin commented 3 days ago

Thanks! I don't know how I've overlooked that so many times 😳