KoljaB / RealtimeSTT

A robust, efficient, low-latency speech-to-text library with advanced voice activity detection, wake word activation and instant transcription.
MIT License
1.24k stars 107 forks source link

How to pass audio file and transcribe it #52

Open GayaaniD opened 2 months ago

GayaaniD commented 2 months ago

Hi team, I am working on implementing a voice chatbot using RealtimeSTT library. For the speech-to-text part, I am using the RealTimeSTT library. Here, I am attempting to provide an audio file as input and transcribe it. You mentioned that if we don't want to use a microphone, we should set 'set_microphone' to False and provide the audio as a 16-bit PCM chunk to obtain the transcribed text as output. i have implemented the code as below.

import soundfile as sf
import numpy as np
import json
from scipy.signal import resample
from RealtimeSTT import AudioToTextRecorder

def format_audio(filepath):
# Read the audio file
    data, samplerate = sf.read(filepath)
    print("data------------->",data)
    pcm_16 = np.maximum(-32768, np.minimum(32767, data*32768)).astype(np.int16)
    print("pcm data----------------->",pcm_16)
    # Create metadata
    metadata = {
        "sampleRate": samplerate,
    }
    metadata_json = json.dumps(metadata)
    metadata_bytes = metadata_json.encode('utf-8')

    # Create buffer for metadata length (4 bytes for 32-bit integer)
    metadata_length = len(metadata_bytes).to_bytes(4, byteorder='little')

    # Combine metadata length, metadata, and audio data into a single message
    combined_data = metadata_length + metadata_bytes + pcm_16.tobytes()
    # print("Combined data: ", combined_data)
    return combined_data

def decode_and_resample(
            audio_data,
            original_sample_rate,
            target_sample_rate):

        # Decode 16-bit PCM data to numpy array
        audio_np = np.frombuffer(audio_data, dtype=np.int16)

        # Calculate the number of samples after resampling
        num_original_samples = len(audio_np)
        num_target_samples = int(num_original_samples * target_sample_rate /
                                 original_sample_rate)

        # Resample the audio
        resampled_audio = resample(audio_np, num_target_samples)

        return resampled_audio.astype(np.int16).tobytes()

if __name__ == '__main__':
    combined_data = format_audio('chat1.wav')
    recorder_config = {
        'spinner': False,
        'use_microphone': False,
        'model': 'tiny.en',
        'language': 'en',
        'silero_sensitivity': 0.4,
        'webrtc_sensitivity': 2,
        'post_speech_silence_duration': 0.7,
        'min_length_of_recording': 0,
        'min_gap_between_recordings': 0,
        'enable_realtime_transcription': False,
        'realtime_processing_pause': 0,
        'realtime_model_type': 'tiny.en'
    }

    recorder = AudioToTextRecorder(**recorder_config)
    metadata_length = int.from_bytes(combined_data[:4], byteorder='little')
    metadata_json = combined_data[4:4+metadata_length].decode('utf-8')
    metadata = json.loads(metadata_json)
    sample_rate = metadata['sampleRate']
    chunk = combined_data[4+metadata_length:]
    resampled_chunk = decode_and_resample(chunk, sample_rate, 16000)
    recorder.feed_audio(resampled_chunk)

    # Get the transcribed text
    text = recorder.text()
    print(f"Transcribed text: {text}")

but i got response as;

[2024-04-28 11:36:10.671] [ctranslate2] [thread 21416] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead. data-------------> [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 9.15527344e-05 1.22070312e-04 1.52587891e-04] pcm data-----------------> [0 0 0 ... 3 4 5] [2024-04-28 11:36:57.949] [ctranslate2] [thread 17900] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead. RealTimeSTT: root - WARNING - Audio queue size exceeds latency limit. Current size: 104. Discarding old audio chunks.

But after this, I have to get the transcribed text as output, but it is running forever and didn't give any output after that warning message. I don't have any clue to identify what I'm doing wrong. According to my findings, for real-time transcription, it is working fine (when we speak, it continuously transcribes), but when we give an audio file, how do we transcribe it? It would be helpful if you provide a solution for passing the audio file and getting the text from it.

Vageeshan commented 2 months ago

I'm facing a similar issue. I'm trying to figure out how to pass an audio file and receive the corresponding text. On the client side, several processes are being executed to prepare the Blob file, which is then sent to the socket. On the Python socket side, the remaining speech-to-text processing occurs. I'm wondering how to achieve the same result by passing the audio file as a parameter?

KoljaB commented 2 months ago

Please use the really great faster-whisper library to transcribe an audio file (this lib is just not the right tool for that).

Explanation: RealtimeSTT depends on timing. I you'd want to transcribe an audio file with it, you'd have to feed a chunk then wait for the time it would need to play out the chunk before feeding the next chunk. It would take much, much longer time to process that than with faster-whisper, which also delivers a full transcript, so I suggest using that library which was especially designed for this purpose.

Also text = recorder.text() would only yield you the first detected full sentence. You'd have to call it repeatedly to retrieve the full transcript. Btw use recorder.shutdown() or create recorder with "with" statement (Context Manager) to prevent it running forever.

GayaaniD commented 2 months ago

Thankyou, I will check further