ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
34.94k stars 3.56k forks source link

Zero-filled WAV give hallucination and wrong duration #1881

Open ukolovda opened 7 months ago

ukolovda commented 7 months ago

I try process WAV file with zeroes in Data section. File duration is 1,2 seconds (attached it).

Whisper.cpp give hallucination (and wrong duration).

zeroes.zip

$ ./main -m ./models/ggml-large-v3.bin -l ru --threads 8 -mc 0 samples/zeroes.wav

whisper_init_from_file_with_params_no_state: loading model from './models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    CUDA0 total size =  3094.36 MB
whisper_model_load: model size    = 3094.36 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   36.26 MB
whisper_init_state: compute buffer (encode) =  926.66 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =  209.26 MB

system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/zeroes.wav' (19200 samples, 1.2 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = ru, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:29.980]   Продолжение следует...

whisper_print_timings:     load time =   685.11 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     4.86 ms
whisper_print_timings:   sample time =    24.48 ms /    79 runs (    0.31 ms per run)
whisper_print_timings:   encode time =   120.78 ms /     1 runs (  120.78 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   323.14 ms /    77 runs (    4.20 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1164.00 ms
$ ./main -m ./models/ggml-large-v2.bin -l ru --threads 8 -mc 0 samples/zeroes.wav
whisper_init_from_file_with_params_no_state: loading model from './models/ggml-large-v2.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    CUDA0 total size =  3093.99 MB
whisper_model_load: model size    = 3093.99 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   34.82 MB
whisper_init_state: compute buffer (encode) =  926.66 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =  209.26 MB

system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/zeroes.wav' (19200 samples, 1.2 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = ru, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:04.000]   Редактор субтитров А.Семкин Корректор А.Егорова

whisper_print_timings:     load time =  2376.23 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =     5.14 ms
whisper_print_timings:   sample time =    50.08 ms /   152 runs (    0.33 ms per run)
whisper_print_timings:   encode time =   238.64 ms /     1 runs (  238.64 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   821.07 ms /   148 runs (    5.55 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  3498.43 ms
$ ./main -m ./models/ggml-large-v3.bin -l ru --threads 8 -mc 0 samples/zeroes.wav -ng
whisper_init_from_file_with_params_no_state: loading model from './models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
whisper_model_load:      CPU total size =  3094.36 MB
whisper_model_load: model size    = 3094.36 MB
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   36.26 MB
whisper_init_state: compute buffer (encode) =  926.66 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =  209.26 MB

system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/zeroes.wav' (19200 samples, 1.2 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = ru, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:29.980]   Субтитры создавал DimaTorzok

whisper_print_timings:     load time =   957.60 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     6.50 ms
whisper_print_timings:   sample time =    24.92 ms /    75 runs (    0.33 ms per run)
whisper_print_timings:   encode time =  4063.61 ms /     1 runs ( 4063.61 ms per run)
whisper_print_timings:   decode time =   565.81 ms /    10 runs (   56.58 ms per run)
whisper_print_timings:   batchd time =  1186.10 ms /    63 runs (   18.83 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  6809.96 ms

I check it on last master branch:

$   git describe --tags
v1.5.4-183-gb602819

I think, this is a bug.

misutoneko commented 7 months ago

This seems to be dependent on the language, I see a similar effect with -l fi and several others. My understanding is that the problem originates from the training data so in that sense it can only be worked around, not really fixed. So the model doesn't give you a "russian silence" token, because there wasn't such thing in the training data to begin with. It can perhaps give you an english or italian one, however it's a different set of tokens for each language. But I suppose entropy or compression ratio should give a hint that this is a non-speech portion, even without involving the model?

Multilingual is a bit tricky anyways, because once you set the language you can't change it (as discussed in #1800). So you can't really detect an "english silence" and then switch languages, unless you cut the sample into smaller pieces with VAD/demucs/whatever. Btw I've actually tried giving the model multiple language tokens to see what happens then, but it didn't work very well.

superchargez commented 7 months ago

This seems to be dependent on the language, I see a similar effect with -l fi and several others. My understanding is that the problem originates from the training data so in that sense it can only be worked around, not really fixed. So the model doesn't give you a "russian silence" token, because there wasn't such thing in the training data to begin with. It can perhaps give you an english or italian one, however it's a different set of tokens for each language. But I suppose entropy or compression ratio should give a hint that this is a non-speech portion, even without involving the model?

Multilingual is a bit tricky anyways, because once you set the language you can't change it (as discussed in #1800). So you can't really detect an "english silence" and then switch languages, unless you cut the sample into smaller pieces with VAD/demucs/whatever. Btw I've actually tried giving the model multiple language tokens to see what happens then, but it didn't work very well.

I reached same conclusion about Urdu, model is limited and is not very good for low resource languages, and can't handle silence for Urdu, and I could not find any VAD model that did well with Urdu non speech either. So, I'm stuck with high WER.

DenisBalan commented 1 day ago

also having some weird sentences coming out of nowhere, russian lang "Редактор субтитров А.Семкин Корректор А.Егорова"

found this list of hallucination as well

https://gist.github.com/waveletdeboshir/8bf52f04bf78018194f25b2390c08309