ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.07k stars 3.58k forks source link

WebRTC as input to stream.cpp #714

Open infinityp913 opened 1 year ago

infinityp913 commented 1 year ago

Hi! Were you able to use WebRTC stream as an input to stream.cpp? I was trying to hack a webRTC input stream into the stream.cpp code but it's not clear to me how I should go about buffering it before passing it on to Whisper. The buffering logic you used in common-sdl.cpp seems to be very intertwined with SDL. Any help would be appreciated!

The whisper.cpp file is breifly described at https://github.com/ggerganov/whisper.cpp/issues/10

mab122 commented 1 year ago

Maybe you could use different approach without any modifications to the code and just routing the audio?

I did it with use of https://github.com/gavv/webrtc-cli First by creating pulseaudio virtual interface:

pactl load-module module-null-sink sink_name=vspeaker sink_properties=device.description=virtual_speaker

then I started whisper ./stream and webrtc-cli:

./webrtc-cli --answer

and followed instructions from demo webpage (copying output of the CLI and pasting answer to webpage): https://gavv.net/webrtc-cli/ After that I connected everthing together using pipewire/qpwgraph: screenshot of running webrtc-cli and whisper and qpwgraph edit: maybe intermediary pulseaudio device is not needed?

goharahmed commented 1 year ago

UPDATE: I found this https://github.com/shirayu/whispering and I'm testing it to send PCMU stream to it. See if it works or need some conversions.

@mab122 interesting approach, is it going to limit the usage to just one individual user or many virtual pulseaudio devices can be created to deceive the stream.cpp?

I don't have as much expertise as you guys in C/C++ but I was able to write some modification into the Golang bindings example provided to listen for an incoming WSS stream.

https://github.com/goharahmed/whisper.cpp/blob/WSSliveTranscriptions/bindings/go/examples/go-whisper/process.go

My problem is more like, what kind of RTP packets I can send, I've started sending PCMU encoded packets but they need conversion to be pushed to the Process() function. Perhaps I should just stick with C++ stream example and add WSS interface to it. Any pointers or suggestions are welcome.

Akshay-akkay commented 1 year ago

I am also trying to do something similar. But stuck at audio byte conversion part.

GRVYDEV commented 1 year ago

Im currently building something similar to this here using the golang bindings. What I have so far is decoding incoming opus RTP packets to pcm f32le. I am going to start working on buffering the pcm and sending to whisper. Ill update here as I make progress. FWIW I plan to give a talk about this in June and will open source the entire project as well.

GRVYDEV commented 1 year ago

Okay so I have managed to get this working. Currently what I have is a WebRTC Client that connects to a SFU, decodes Opus packets to PCM and then buffers and samples that audio in a similar way to the stream.cpp example. This is in a very very rough state right now but wanted to open source it so others can use it as a reference. As a note this is meant to run on localhost for now. I have no idea how this will behave under packet loss and I also have not yet added a jitter buffer. I will be working on this pretty extensively over the next couple of weeks. Feel free to ask any questions you may have. I will link relevant parts of the code below.

Receiving the RTP Packets: https://github.com/GRVYDEV/S.A.T.U.R.D.A.Y/blob/f6380bbd9e2c9ab17c68d7cdb97778bd44a01201/client/peer_connection.go#L56-L74

Decoding the packets to f32le PCM: https://github.com/GRVYDEV/S.A.T.U.R.D.A.Y/blob/f6380bbd9e2c9ab17c68d7cdb97778bd44a01201/client/audio_engine.go#L80-L93 and https://github.com/GRVYDEV/S.A.T.U.R.D.A.Y/blob/f6380bbd9e2c9ab17c68d7cdb97778bd44a01201/client/audio_engine.go#L98-L108

Audio sample / buffering logic: https://github.com/GRVYDEV/S.A.T.U.R.D.A.Y/blob/f6380bbd9e2c9ab17c68d7cdb97778bd44a01201/client/whisper_engine.go#L56-L61 and https://github.com/GRVYDEV/S.A.T.U.R.D.A.Y/blob/f6380bbd9e2c9ab17c68d7cdb97778bd44a01201/client/whisper_engine.go#L72-L91

Whisper inference: https://github.com/GRVYDEV/S.A.T.U.R.D.A.Y/blob/f6380bbd9e2c9ab17c68d7cdb97778bd44a01201/client/whisper.go#L37-L48