Open GayaaniD opened 6 months ago
I'm facing a similar issue. I'm trying to figure out how to pass an audio file and receive the corresponding text. On the client side, several processes are being executed to prepare the Blob file, which is then sent to the socket. On the Python socket side, the remaining speech-to-text processing occurs. I'm wondering how to achieve the same result by passing the audio file as a parameter?
Please use the really great faster-whisper library to transcribe an audio file (this lib is just not the right tool for that).
Explanation: RealtimeSTT depends on timing. I you'd want to transcribe an audio file with it, you'd have to feed a chunk then wait for the time it would need to play out the chunk before feeding the next chunk. It would take much, much longer time to process that than with faster-whisper, which also delivers a full transcript, so I suggest using that library which was especially designed for this purpose.
Also text = recorder.text() would only yield you the first detected full sentence. You'd have to call it repeatedly to retrieve the full transcript. Btw use recorder.shutdown() or create recorder with "with" statement (Context Manager) to prevent it running forever.
Thankyou, I will check further
Hi team, I am working on implementing a voice chatbot using RealtimeSTT library. For the speech-to-text part, I am using the RealTimeSTT library. Here, I am attempting to provide an audio file as input and transcribe it. You mentioned that if we don't want to use a microphone, we should set 'set_microphone' to False and provide the audio as a 16-bit PCM chunk to obtain the transcribed text as output. i have implemented the code as below.
but i got response as;
[2024-04-28 11:36:10.671] [ctranslate2] [thread 21416] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead. data-------------> [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 9.15527344e-05 1.22070312e-04 1.52587891e-04] pcm data-----------------> [0 0 0 ... 3 4 5] [2024-04-28 11:36:57.949] [ctranslate2] [thread 17900] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead. RealTimeSTT: root - WARNING - Audio queue size exceeds latency limit. Current size: 104. Discarding old audio chunks.
But after this, I have to get the transcribed text as output, but it is running forever and didn't give any output after that warning message. I don't have any clue to identify what I'm doing wrong. According to my findings, for real-time transcription, it is working fine (when we speak, it continuously transcribes), but when we give an audio file, how do we transcribe it? It would be helpful if you provide a solution for passing the audio file and getting the text from it.