RusticRoman commented 2 months ago

Hello:

I ran into the issue when I am running the whisper.cpp on audio with multiple languages (English and Hebrew). For about 26 minutes it works fine, however it goes into infinite loop and outputs the same text time and again. This is replicable easily.

I run on GCP A100 platform, compiled whisper.cpp and ran. This is how I compiled:

sudo apt install ccache make GGML_CUDA=1 -j4

Here's the audio file: https://www.softfinity.com/output.wav Here's the gguf I am using: whisper-large-v3-f16.gguf Here's the script that I ran:

!/bin/bash

start_time=$(date +%s) echo "Starting whisper-large-v3-q8_0.gguf on output.wav at $(date +%Y-%m-%d\ %H:%M:%S)" ./main -m whisper-large-v3-q8_0.gguf -f output.wav end_time=$(date +%s) echo "Finished whisper-large-v3-q8_0.gguf on output.wav at $(date +%Y-%m-%d\ %H:%M:%S)" echo "Time elapsed: $((end_time - start_time)) seconds" echo ""

start_time=$(date +%s) echo "Starting whisper-large-v3-f16.gguf on output.wav at $(date +%Y-%m-%d\ %H:%M:%S)" ./main -m whisper-large-v3-f16.gguf -f output.wav end_time=$(date +%s) echo "Finished whisper-large-v3-f16.gguf on output.wav at $(date +%Y-%m-%d\ %H:%M:%S)" echo "Time elapsed: $((end_time - start_time)) seconds"

----- this is what I see ----- Starting whisper-large-v3-q8_0.gguf on output.wav at 2024-08-15 18:16:58 whisper_init_from_file_with_params_no_state: loading model from 'whisper-large-v3-q8_0.gguf' whisper_init_with_params_no_state: use gpu = 1 whisper_init_with_params_no_state: flash attn = 0 whisper_init_with_params_no_state: gpu_device = 0 whisper_init_with_params_no_state: dtw = 0 whisper_model_load: loading model whisper_model_load: n_vocab = 51866 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 1280 whisper_model_load: n_audio_head = 20 whisper_model_load: n_audio_layer = 32 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 1280 whisper_model_load: n_text_head = 20 whisper_model_load: n_text_layer = 32 whisper_model_load: n_mels = 128 whisper_model_load: ftype = 1 whisper_model_load: qntvr = 0 whisper_model_load: type = 5 (large v3) whisper_model_load: adding 1609 extra tokens whisper_model_load: n_langs = 100 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes whisper_model_load: CUDA0 total size = 3094.36 MB whisper_model_load: model size = 3094.36 MB whisper_backend_init_gpu: using CUDA backend whisper_init_state: kv self size = 251.66 MB whisper_init_state: kv cross size = 251.66 MB whisper_init_state: kv pad size = 7.86 MB whisper_init_state: compute buffer (conv) = 36.13 MB whisper_init_state: compute buffer (encode) = 926.53 MB whisper_init_state: compute buffer (cross) = 9.25 MB whisper_init_state: compute buffer (decode) = 215.82 MB

main: processing 'output.wav' (76719473 samples, 4795.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:10.440] Okay, so I just want to reiterate quickly what the message that I left you. [00:00:10.440 --> 00:00:13.980] Baruch Hashem, everybody did very, very well. [00:00:13.980 --> 00:00:21.280] You showed a real, you were comfortable with the subject, which is really the, you know, [00:00:21.280 --> 00:00:26.300] when you're talking about being introduced to so many, to basically a whole different language, .... [00:23:43.780 --> 00:23:44.780] The meat is part of the usher. [00:23:44.780 --> 00:23:45.780] When I ask you why you're not allowed to eat this meat, because the usher of eating the [00:23:45.780 --> 00:23:46.780] meat is baser b'chalef. [00:23:46.780 --> 00:23:47.780] The usher isn't chalef. [00:23:47.780 --> 00:23:48.780] The usher is the baser b'chalef. [00:23:48.780 --> 00:23:49.780] So the meat is part of the usher. [00:23:49.780 --> 00:23:50.780] So that's why you have to obviously have 60 against the usher in this case. [00:23:50.780 --> 00:23:51.780] But the meat is part of the usher. [00:23:51.780 --> 00:23:52.780] So the meat is part of the usher. [00:23:52.780 --> 00:23:53.780] So the meat is part of the usher. [00:23:53.780 --> 00:23:54.780] So the meat is part of the usher. [00:23:54.780 --> 00:23:55.780] So the meat is part of the usher. [00:23:55.780 --> 00:23:56.780] So the meat is part of the usher. [00:23:56.780 --> 00:23:57.780] So the meat is part of the usher. [00:23:57.780 --> 00:23:58.780] So the meat is part of the usher. [00:23:58.780 --> 00:23:59.780] So the meat is part of the usher. [00:23:59.780 --> 00:24:00.780] So the meat is part of the usher.

RusticRoman commented 2 months ago

this is the tech stack used:

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

RusticRoman commented 2 months ago

On V100 machine the whisper just stall, here's the card's params:

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 36946 C ./main 3644MiB | +-----------------------------------------------------------------------------------------+

RusticRoman commented 2 months ago

Not an issue, I used gguf file instead of bin file for the model, when I but the bin file - everything works correctly.

ggerganov / whisper.cpp

transcribing audio files to text goes into infinite loop for audios with multiple languages #2356

!/bin/bash