Open infinityp913 opened 1 year ago
Maybe you could use different approach without any modifications to the code and just routing the audio?
I did it with use of https://github.com/gavv/webrtc-cli
First by creating pulseaudio
virtual interface:
pactl load-module module-null-sink sink_name=vspeaker sink_properties=device.description=virtual_speaker
then I started whisper ./stream
and webrtc-cli
:
./webrtc-cli --answer
and followed instructions from demo webpage (copying output of the CLI and pasting answer to webpage): https://gavv.net/webrtc-cli/ After that I connected everthing together using pipewire/qpwgraph: edit: maybe intermediary pulseaudio device is not needed?
UPDATE: I found this https://github.com/shirayu/whispering and I'm testing it to send PCMU stream to it. See if it works or need some conversions.
@mab122 interesting approach, is it going to limit the usage to just one individual user or many virtual pulseaudio devices can be created to deceive the stream.cpp?
I don't have as much expertise as you guys in C/C++ but I was able to write some modification into the Golang bindings example provided to listen for an incoming WSS stream.
My problem is more like, what kind of RTP packets I can send, I've started sending PCMU encoded packets but they need conversion to be pushed to the Process() function. Perhaps I should just stick with C++ stream example and add WSS interface to it. Any pointers or suggestions are welcome.
I am also trying to do something similar. But stuck at audio byte conversion part.
Im currently building something similar to this here using the golang bindings. What I have so far is decoding incoming opus RTP packets to pcm f32le. I am going to start working on buffering the pcm and sending to whisper. Ill update here as I make progress. FWIW I plan to give a talk about this in June and will open source the entire project as well.
Okay so I have managed to get this working. Currently what I have is a WebRTC Client that connects to a SFU, decodes Opus packets to PCM and then buffers and samples that audio in a similar way to the stream.cpp example. This is in a very very rough state right now but wanted to open source it so others can use it as a reference. As a note this is meant to run on localhost for now. I have no idea how this will behave under packet loss and I also have not yet added a jitter buffer. I will be working on this pretty extensively over the next couple of weeks. Feel free to ask any questions you may have. I will link relevant parts of the code below.
Receiving the RTP Packets: https://github.com/GRVYDEV/S.A.T.U.R.D.A.Y/blob/f6380bbd9e2c9ab17c68d7cdb97778bd44a01201/client/peer_connection.go#L56-L74
Decoding the packets to f32le PCM: https://github.com/GRVYDEV/S.A.T.U.R.D.A.Y/blob/f6380bbd9e2c9ab17c68d7cdb97778bd44a01201/client/audio_engine.go#L80-L93 and https://github.com/GRVYDEV/S.A.T.U.R.D.A.Y/blob/f6380bbd9e2c9ab17c68d7cdb97778bd44a01201/client/audio_engine.go#L98-L108
Audio sample / buffering logic: https://github.com/GRVYDEV/S.A.T.U.R.D.A.Y/blob/f6380bbd9e2c9ab17c68d7cdb97778bd44a01201/client/whisper_engine.go#L56-L61 and https://github.com/GRVYDEV/S.A.T.U.R.D.A.Y/blob/f6380bbd9e2c9ab17c68d7cdb97778bd44a01201/client/whisper_engine.go#L72-L91
Whisper inference: https://github.com/GRVYDEV/S.A.T.U.R.D.A.Y/blob/f6380bbd9e2c9ab17c68d7cdb97778bd44a01201/client/whisper.go#L37-L48
Hi! Were you able to use WebRTC stream as an input to stream.cpp? I was trying to hack a webRTC input stream into the stream.cpp code but it's not clear to me how I should go about buffering it before passing it on to Whisper. The buffering logic you used in common-sdl.cpp seems to be very intertwined with SDL. Any help would be appreciated!
The whisper.cpp file is breifly described at https://github.com/ggerganov/whisper.cpp/issues/10