ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
36.03k stars 3.68k forks source link

Server example? #1369

Open Azeirah opened 1 year ago

Azeirah commented 1 year ago

I'm working on a voice-controlled application and I want to run small .wav files through whisper fairly often.

What I noticed is that it takes almost 50% of the total time just to load the model every single time I run ./main -m ... "my-short-spoken-command.wav"

I think it'd be nice if like in llama.cpp this project includes a server example so the model only has to be loaded once and stays in memory after loading.

Azeirah commented 1 year ago

For what it's worth, I already have a very rudimentary server example working. It's a bit of a frankenstein copy-paste work of whisper/examples/main and llama.cpp/examples/server/server.cpp but it works. I'm not great at c++ whatsoever so I was happy to be able to copy and paste almost everything from those two examples.

It supports configuring the server in the exact same way as the llama server, and it supports (untested) these params:

    int32_t n_threads = std::min(12, (int32_t) std::thread::hardware_concurrency());
    int32_t n_processors = 1;
    int32_t offset_t_ms = 0;
    int32_t offset_n = 0;
    int32_t duration_ms = 0;
    int32_t progress_step = 5;
    int32_t max_context = -1;
    int32_t max_len = 0;
    int32_t best_of = 2;
    int32_t beam_size = -1;
    std::string model = "models/ggml-base.en.bin"

But not diarization, language or any output option, my goal was to get a working server for my own application.

Anyone interested in a PR?

FSSRepo commented 1 year ago

Maybe, when I finish working on optimizing stable-diffusion.cpp and adding a server to it, I could create a server example for whisper.cpp.

Azeirah commented 1 year ago

I posted the code as a PR https://github.com/ggerganov/whisper.cpp/pull/1375

ggerganov commented 1 year ago

Hey all, I notice several server examples being proposed. This is super cool!

I'm planning to do a major update to whisper.cpp in the following few days, bringing some new features and performance improvements. This will be the highest priority, so for the sake of less distractions, the server examples will have to wait until we finish the new release. Sorry for the delay

ggerganov commented 1 year ago

Hi again! I think we should restart the server efforts now that v1.5.0 is released.

I like both #1375 and #1380, so not sure how to decide which one to integrate. @Azeirah and @felrock (and others): do you have any opinion on this.

Also, I think we should aim to support the OpenAI Audio API for speech to text: https://platform.openai.com/docs/api-reference/audio

The approach in #1418 is also interesting, so it can be merged as an alternative solution to the REST-based server example.

felrock commented 1 year ago

Hello! I'm keen on fixing and merging my changes for the server. I've seen that the use for server in llama.cpp has enabled projects such as ollama and others. So i think it's an important application to have, for users to easily create interfaces against.

I have also started to create a similar server solution for bark.cpp because in my use case I would like to have some sort of voice(a bit more granular than espeak). Which would complete the full llm robot, a brain(llama.cpp), ears(whisper.cpp) and voice(bark.cpp).

ggerganov commented 1 year ago

Yup, I agree that a server can find many interesting applications.

Which would complete the full llm robot, a brain(llama.cpp), ears(whisper.cpp) and voice(bark.cpp).

Yes! Great idea - we are getting close :)

colinator commented 1 year ago

Also agree. To hawk my proposal #1418 (that fork is a bit messy, but something like it) - I think it'd be really great to have the ability to create many types of servers. For instance, I might want a gRPC server. Or a rest server. Or a ROS pub-sub node. Likewise: many types of encodings for the result: maybe json, maybe bson, maybe protobuf, etc. I think it'd require very little refactoring - basically, just the core stream server as a class with a method infer(audio data*). Happy to help!

nortekax commented 1 year ago

Which would complete the full llm robot, a brain(llama.cpp), ears(whisper.cpp) and voice(bark.cpp). The best I found for voice is https://github.com/rhasspy/piper , it has a nice sounding voice and it is faster than bark llm.

ggerganov commented 1 year ago

First pass of a server example has been merged (#1380).

Looks like streaming and diarization are 2 of the most requested features for the server. Not sure if we can do something meaningful for diarization, but we should able to provide a streaming API relatively easy.

felrock commented 1 year ago

I left the diarization parameters in there so it might be working, I didn't know how it worked or how to test it.

nortekax commented 1 year ago

The server works well, but when speech is short, like "lights on", "lights off", etc, it doesn't produce any text.

I suspect whisper.cpp needs a long context because the command example asks for a long sentence first, before it can work properly. The command example tells the user:

process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'

I think a way to provide a context for the server (like command does) would be useful to provide agents that need short commands, like "lights on", "lights off", etc.

felrock commented 1 year ago

Ok, I've tried sending single word .wav files to the server and have it respond with the correct work. Did you try using the prompt flag? Should be something similar to what you describe.

ggerganov commented 1 year ago

It should work with short audio too. The prompt can help in some situations to make the transcript more robust, but is not required in general.

Azeirah commented 1 year ago

The server works well, but when speech is short, like "lights on", "lights off", etc, it doesn't produce any text.

I suspect whisper.cpp needs a long context because the command example asks for a long sentence first, before it can work properly. The command example tells the user:

process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'

I think a way to provide a context for the server (like command does) would be useful to provide agents that need short commands, like "lights on", "lights off", etc.

I haven't tested this specific server implementation, but the server implementation I was using previously definitely did work with short commands, I specifically made it for that purpose.

So either

Have you tried it with longer audio?

nortekax commented 1 year ago

Have you tried it with longer audio?

Longer audio always works well; the same problem that happens with server also happens with main. Only command works well with the short voice commands, but only after you say the first long sentence that command asks you to say.

Edit: I am using sox to make a wav file, @Azeirah what do you use to make the wav file?

nortekax commented 1 year ago

From @felrock

Ok, I've tried sending single word .wav files to the server and have it respond with the correct work. Did you try using the prompt flag? Should be something similar to what you describe.

From @ggerganov

It should work with short audio too. The prompt can help in some situations to make the transcript more robust, but is not required in general.

I am using the following (based on https://stackoverflow.com/questions/30006609/using-sox-for-voice-detection-and-streaming) to generate the wav:

sox -q -c 1 -r 16000 -d  -b 16 -r 16000 "$outwav"  silence 1 0.3 1% 1 0.3 1%

I play the wav and it is okay. How do you generate your wavs ?

felrock commented 1 year ago

Im using a python script, which uses PyAudio. I record for about three seconds per .wav file

nortekax commented 1 year ago

thanks, how do you detect voice?

nortekax commented 1 year ago

Okay, I solved the problem. Post it here for those interested. I simply padded the wav file with 500ms of silence at the beginning and 1 second of silence at the end, and everything works fine now.

nortekax commented 1 year ago

It would be really useful to add the grammar/commands.txt functionality in the command example.

ggerganov commented 1 year ago

Ah sorry, about that - I forgot there is logic to ignore sub one second audio:

https://github.com/ggerganov/whisper.cpp/blob/a5881d619c8440c0e0f226be15fc5ab0eec5f9bb/whisper.cpp#L5193-L5198

bobqianic commented 11 months ago

Ah sorry, about that - I forgot there is logic to ignore sub one second audio:

https://github.com/ggerganov/whisper.cpp/blob/a5881d619c8440c0e0f226be15fc5ab0eec5f9bb/whisper.cpp#L5193-L5198

BTW, why we need this logic here?

See #1603