ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
34.87k stars 3.55k forks source link

Andoid mic detects low decibel sound and trigger vad repeatedly in "./command -m models/ggml-tiny.bin -t 8 -ac 768" #1149

Open trappedinspacetime opened 1 year ago

trappedinspacetime commented 1 year ago

First of all, I thank you Georgi Gerganov and all who contributed to this project. I have a progressive neuro-muscular disease and I almost can not use my hands. I bought a new android mobile to ease my life. It has 4GB+4GBVRAM. I tried to use "Hey Google" voice assistant together with "Google Voice Access". I am not a native English speaker, "Hey Google" is missing some features in my language. It doesn't hang up when I accidentally call someone. It has many weak points and bugs indeed. And it runs only online with an internet access.

I tested "whisper.cpp" "./command -m models/ggml-tiny.bin -t 8 -ac 768", in my Ubuntu 22.04 it works well. I managed to build it in my android mobile. It launches without error but it repeatedly prints:

process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.' without letting me pronounce the phrase "Ok Whisper, start listening for commands."

    ./command -m models/ggml-tiny.bin -t 8 -ac 768
    whisper_init_from_file_no_state: loading model from 'models/ggml-tiny.bin'
    whisper_model_load: loading model
    whisper_model_load: n_vocab       = 51865
    whisper_model_load: n_audio_ctx   = 1500
    whisper_model_load: n_audio_state = 384
    whisper_model_load: n_audio_head  = 6
    whisper_model_load: n_audio_layer = 4
    whisper_model_load: n_text_ctx    = 448
    whisper_model_load: n_text_state  = 384
    whisper_model_load: n_text_head   = 6
    whisper_model_load: n_text_layer  = 4
    whisper_model_load: n_mels        = 80
    whisper_model_load: ftype         = 1
    whisper_model_load: qntvr         = 0
    whisper_model_load: type          = 1
    whisper_model_load: mem required  =  201.00 MB (+    3.00 MB per decoder)
    whisper_model_load: adding 1608 extra tokens
    whisper_model_load: model ctx     =   73.62 MB
    whisper_model_load: model size    =   73.54 MB
    whisper_init_state: kv self size  =    2.62 MB
    whisper_init_state: kv cross size =    8.79 MB

    main: processing, 8 threads, lang = en, task = transcribe, timestamps = 0 ...

    init: found 0 capture devices:
    init: attempt to open default capture device ...
    init: obtained spec for input device (SDL Id = 2):
    init:     - sample rate:       16000
    init:     - format:            33056 (required: 33056)
    init:     - channels:          1 (required: 1)
    init:     - samples per frame: 1024

    process_general_transcription: general-purpose mode

    process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'

    process_general_transcription: Speech detected! Processing ...
    process_general_transcription: Heard '.', (t = 2548 ms)
    process_general_transcription: WARNING: prompt not recognized, try again

    process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'

    process_general_transcription: Speech detected! Processing ...
    process_general_transcription: Heard '[ Silence ]', (t = 2307 ms)
    process_general_transcription: WARNING: prompt not recognized, try again

    process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'

    process_general_transcription: Speech detected! Processing ...
    process_general_transcription: Heard '[MUSIC PLAYING]', (t = 2255 ms)
    process_general_transcription: WARNING: prompt not recognized, try again

    process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'

    process_general_transcription: Speech detected! Processing ...
    ^Xprocess_general_transcription: Heard 'You', (t = 2229 ms)
    process_general_transcription: WARNING: prompt not recognized, try again

    process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'

    process_general_transcription: Speech detected! Processing ...
    ^Cprocess_general_transcription: Heard '[no audio]', (t = 2306 ms)
    process_general_transcription: WARNING: prompt not recognized, try again

    process_general_transcription: Say the following phrase: 'Ok Whisper, start listening for commands.'

    whisper_print_timings:     load time =   657.88 ms
    whisper_print_timings:     fallbacks =   0 p /   0 h
    whisper_print_timings:      mel time =  1193.62 ms
    whisper_print_timings:   sample time =   365.44 ms /   163 runs (    2.24 ms per run)
    whisper_print_timings:   encode time =  6374.13 ms /     5 runs ( 1274.83 ms per run)
    whisper_print_timings:   decode time =  3695.82 ms /   153 runs (   24.16 ms per run)
    whisper_print_timings:    total time = 19341.23 ms

I plan to use it to end a phone call and for other tasks. Would you please guide me?

bobqianic commented 1 year ago

It sounds like there are some potential bugs in command.cpp, I will go check it out.

https://github.com/ggerganov/whisper.cpp/blob/a792c4079ce61358134da4c9bc589c15a03b04ad/examples/command/command.cpp#L490-L596

trappedinspacetime commented 1 year ago

@bobqianic Thank you for responding. Could it be related to VAD module?

trappedinspacetime commented 1 year ago

Any progress?

ggerganov commented 1 year ago

@trappedinspacetime

You can try adjusting the VAD-related parameters:

  -vth N,     --vad-thold N    [0.60   ] voice activity detection threshold
  -fth N,     --freq-thold N   [100.00 ] high-pass frequency cutoff

Probably the default values are not OK for your setup

trappedinspacetime commented 1 year ago

@ggerganov thank you for responding. I tried -vth values such as 0.4 0.8 0.9 1.4 1.7 nothing has changed.

VJJJJJJ1 commented 7 months ago

@trappedinspacetime I have the same problem like this: "process_general_transcription: Speech detected! Processing ... process_general_transcription: Heard 'you', (t = 2132 ms, p = 72.20%) process_general_transcription: WARNING: prompt not recognized, try again", have you solved this problem?

trappedinspacetime commented 7 months ago

@trappedinspacetime I have the same problem like this: "process_general_transcription: Speech detected! Processing ... process_general_transcription: Heard 'you', (t = 2132 ms, p = 72.20%) process_general_transcription: WARNING: prompt not recognized, try again", have you solved this problem?

Unfortunately, no. I hope somebody finds a solution.

Iskander14yo commented 5 months ago

Happened to me when I used my Mac with monitor. When Mac is closed it doesn't get any sounds in mic (I guess) and b/c of that (I guess) there are strange "you" listened by program. Try to use separate microphone (in my case Airpods worked well).