ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.54k stars 3.62k forks source link

CUDA error #2258

Closed MathiasSchindler closed 4 months ago

MathiasSchindler commented 4 months ago

When using whisper.cpp with CUDA compilation, the model starts as usual but crashes after a brief moment. Using the -ng flag to disable the GPU, the model works with the expected CPU speed.

`mathias@mathias-b650:~/whisper.cpp$ ./main -m models/ggml-large-v3.bin samples/bundestag-svea.wav whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin' whisper_init_with_params_no_state: use gpu = 1 whisper_init_with_params_no_state: flash attn = 0 whisper_init_with_params_no_state: gpu_device = 0 whisper_init_with_params_no_state: dtw = 0 whisper_model_load: loading model whisper_model_load: n_vocab = 51866 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 1280 whisper_model_load: n_audio_head = 20 whisper_model_load: n_audio_layer = 32 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 1280 whisper_model_load: n_text_head = 20 whisper_model_load: n_text_layer = 32 whisper_model_load: n_mels = 128 whisper_model_load: ftype = 1 whisper_model_load: qntvr = 0 whisper_model_load: type = 5 (large v3) whisper_model_load: adding 1609 extra tokens whisper_model_load: n_langs = 100 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes whisper_model_load: CUDA0 total size = 3094.36 MB whisper_model_load: model size = 3094.36 MB whisper_backend_init_gpu: using CUDA backend whisper_mel_init: n_len = 3001, n_len_org = 1, n_mel = 128 whisper_mel_init: n_len = 6000, n_len_org = 6000, n_mel = 128 whisper_init_state: kv self size = 251.66 MB whisper_init_state: kv cross size = 251.66 MB whisper_init_state: kv pad size = 7.86 MB whisper_init_state: compute buffer (conv) = 36.26 MB whisper_init_state: compute buffer (encode) = 926.66 MB whisper_init_state: compute buffer (cross) = 9.38 MB whisper_init_state: compute buffer (decode) = 215.95 MB

system_info: n_threads = 4 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0

main: processing 'samples/bundestag-svea.wav' (108898656 samples, 6806.2 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

whisper_mel_init: n_len = 683616, n_len_org = 680616, n_mel = 128

[00:00:00.240 --> 00:00:10.160] So, I welcome everyone to our 57th session of the Digital Committee and to the public hearing. [00:00:10.160 --> 00:00:14.920] Today we have set up a single agenda item. [00:00:14.920 --> 00:00:19.480] I'll start with, this is now just a formality for those who are sitting. [00:00:19.480 --> 00:00:25.360] So it's about the federal government's bill, namely the draft of a bill for the implementation CUDA error: invalid argument current device: 0, in function ggml_backend_cuda_graph_compute at ggml-cuda.cu:2689 cudaGraphKernelNodeSetParams(cuda_ctx->cuda_graph->nodes[i], &cuda_ctx->cuda_graph->params[i]) GGML_ASSERT: ggml-cuda.cu:100: !"CUDA error" Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Vorgang nicht zulässig. No stack. The program is not being run. Abgebrochen (Speicherabzug geschrieben) `

bobqianic commented 4 months ago

CC: @slaren

bobqianic commented 4 months ago

@MathiasSchindler Can you list the versions of Linux and CUDA you are using?

ggerganov commented 4 months ago

Make sure to use the latest whisper.cpp version and if possible provide a sample audio that reproduces the issue

deepakjois commented 4 months ago

Ran into what I believe is the same issue. (Ubuntu 22.04 running with GPU instance on fly.io, built using a Dockerfile identical to the one in this repo).

Sample Audio: https://www.listennotes.com/e/p/aa96e274a3a845a086dffc6c02d60288/

(Converted to wav on the same machine with ffmpeg -i test.mp3 -ar 16000 -ac 1 -c:a pcm_s16le test.wav

root@4d891513ce1648:/app# ./main -m models/ggml-large-v3.bin /home/deepak/test.wav
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
whisper_model_load:    CUDA0 total size =  3094.36 MB
whisper_model_load: model size    = 3094.36 MB
whisper_backend_init_gpu: using CUDA backend
whisper_mel_init: n_len = 3001, n_len_org = 1, n_mel = 128
whisper_mel_init: n_len = 6000, n_len_org = 6000, n_mel = 128
whisper_init_state: kv self size  =  251.66 MB
whisper_init_state: kv cross size =  251.66 MB
whisper_init_state: kv pad  size  =    7.86 MB
whisper_init_state: compute buffer (conv)   =   36.26 MB
whisper_init_state: compute buffer (encode) =  926.66 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =  215.95 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0

main: processing '/home/deepak/test.wav' (114111634 samples, 7132.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

whisper_mel_init: n_len = 716197, n_len_org = 713197, n_mel = 128

[00:00:00.160 --> 00:00:06.100]   The science and practice of enhancing human performance for sport, play, and life.
[00:00:06.100 --> 00:00:08.580]   Welcome to Perform.
[00:00:08.580 --> 00:00:14.020]   I'm Andy Galpin, a professor of kinesiology in the Center for Sport Performance at Cal State Fullerton.
[00:00:14.020 --> 00:00:16.580]   In today's episode, we're going to be talking about the heart.
[00:00:16.580 --> 00:00:19.540]   And I'd like to start with a very simple question.
[00:00:19.540 --> 00:00:21.920]   And that is, why do you breathe?
[00:00:21.920 --> 00:00:24.560]   Now, that may have caught you off guard.
[00:00:24.560 --> 00:00:26.880]   And so I'll let you think about it for a quick second.
[00:00:26.880 --> 00:00:29.360]   Why is it that you breathe?
CUDA error: invalid argument
  current device: 0, in function ggml_backend_cuda_graph_compute at ggml-cuda.cu:2689
  cudaGraphKernelNodeSetParams(cuda_ctx->cuda_graph->nodes[i], &cuda_ctx->cuda_graph->params[i])
GGML_ASSERT: ggml-cuda.cu:100: !"CUDA error"
Aborted
root@4d891513ce1648:/app#
slaren commented 4 months ago

CUDA graphs should not be used outside of llama.cpp, the code is very finicky and it will not work correctly in different situations. To disable it, GGML_CUDA_USE_GRAPHS should not be defined in the build scripts.

deepakjois commented 4 months ago

fwiw, setting env GGML_CUDA_DISABLE_GRAPHS to 1 makes the error go away. So I am thinking it's somethign to do with 24f0aa46 that went in last month.

MathiasSchindler commented 4 months ago

@MathiasSchindler Can you list the versions of Linux and CUDA you are using?

$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Fri_Jan__6_16:45:21_PST_2023 Cuda compilation tools, release 12.0, V12.0.140 Build cuda_12.0.r12.0/compiler.32267302_0

$ uname -a Linux mathias-b650 6.8.0-35-generic #35-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 15:51:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu Linux 24.4

MathiasSchindler commented 4 months ago

I am compiling whisper.cpp from the current master and I ran into the following compile error:

cc -Iggml/include -Iggml/src -Iinclude -Isrc -Iexamples -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -DGGML_CUDA_USE_GRAPHS  -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion  -c tests/test-c.c -o tests/test-c.o
In file included from examples/common.cpp:8:
examples/dr_wav.h:3643:24: warning: no previous declaration for ‘drwav_bool32 drwav_seek_to_first_pcm_frame(drwav*)’ [-Wmissing-declarations]
 3643 | DRWAV_API drwav_bool32 drwav_seek_to_first_pcm_frame(drwav* pWav)
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/whisper-mel-cuda.cu(176): error: identifier "ggml_backend_cuda_context" is undefined

src/whisper-mel-cuda.cu(176): error: identifier "cuda_ctx" is undefined

src/whisper-mel-cuda.cu(176): error: expected an expression

src/whisper-mel-cuda.cu(176): error: expected a ";"

src/whisper-mel-cuda.cu(179): error: identifier "ggml_cuda_info" is undefined

src/whisper-mel-cuda.cu(186): error: identifier "ggml_cuda_set_device" is undefined

src/whisper-mel-cuda.cu(193): error: identifier "CUDA_CHECK" is undefined

src/whisper-mel-cuda.cu(194): error: identifier "CUBLAS_CHECK" is undefined

src/whisper-mel-cuda.cu(217): error: identifier "ggml_cuda_set_device" is undefined

src/whisper-mel-cuda.cu(218): error: identifier "CUDA_CHECK" is undefined

src/whisper-mel-cuda.cu(237): error: identifier "CUDA_CHECK" is undefined

src/whisper-mel-cuda.cu(241): error: identifier "CUDA_CHECK_GEN" is undefined

src/whisper-mel-cuda.cu(242): error: identifier "CUDA_CHECK" is undefined

src/whisper-mel-cuda.cu(248): error: identifier "CUDA_CHECK" is undefined

src/whisper-mel-cuda.cu(260): error: identifier "CUDA_CHECK" is undefined

src/whisper-mel-cuda.cu(267): error: identifier "ggml_cuda_set_device" is undefined

src/whisper-mel-cuda.cu(286): error: identifier "CUDA_CHECK" is undefined

src/whisper-mel-cuda.cu(298): error: identifier "CUDA_CHECK_GEN" is undefined

src/whisper-mel-cuda.cu(318): error: identifier "CUBLAS_CHECK" is undefined

19 errors detected in the compilation of "src/whisper-mel-cuda.cu".
make: *** [Makefile:624: src/whisper-mel-cuda.o] Fehler 2
make: *** Auf noch nicht beendete Prozesse wird gewartet …
ggerganov commented 4 months ago

dc8cc2dd6fcba4629af7ec751ca42ab13f7d6e4e should fix this

MathiasSchindler commented 4 months ago

dc8cc2d should fix this

I have downloaded 0a55a70 and the compile error remains. I double-checked that I have the current master branch in the current version.

A while ago, my system updated some nvidia drivers from version 535 to 550 but I am uncertain if that is a factor here since I had successful compilations after that update.

ggerganov commented 4 months ago

Try to remove the build folder and build from scratch

MathiasSchindler commented 4 months ago

Try to remove the build folder and build from scratch

I used make clean. I also downloaded the source from github to a different directory and tried to build it from scratch with

WHISPER_CUDA=1 make -j

with the same result.

building it with make instead does work, there are however two warnings:

cc -Iggml/include -Iggml/src -Iinclude -Isrc -Iexamples -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -fopenmp -Wdouble-promotion     -c ggml/src/ggml-quants.c -o ggml/src/ggml-quants.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Iexamples -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP  -c src/whisper.cpp -o src/whisper.o
src/whisper.cpp: In function ‘ggml_backend* whisper_backend_init_gpu(const whisper_context_params&)’:
src/whisper.cpp:1234:79: warning: unused parameter ‘params’ [-Wunused-parameter]
 1234 | static ggml_backend_t whisper_backend_init_gpu(const whisper_context_params & params) {
      |                                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Iexamples -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP  -c examples/common.cpp -o examples/common.o
In file included from examples/common.cpp:8:
examples/dr_wav.h:3643:24: warning: no previous declaration for ‘drwav_bool32 drwav_seek_to_first_pcm_frame(drwav*)’ [-Wmissing-declarations]
 3643 | DRWAV_API drwav_bool32 drwav_seek_to_first_pcm_frame(drwav* pWav)
      |                        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggerganov commented 4 months ago

Ah, I forgot to update the Makefile build. Should be fixed now with 9f7f36d4c95356bff1c287beab26b6a4538e2230

slaren commented 4 months ago

The Makefile is still using GGML_CUDA_USE_GRAPHS.

MathiasSchindler commented 4 months ago

Ah, I forgot to update the Makefile build. Should be fixed now with 9f7f36d

On the plus side, whisper.cpp now compiles again, thank you.

On the negative side, running it will work for the first seconds and then result in a CUDA error: invalid argument:

$ ./main -m models/ggml-large-v3.bin samples/bundestag-svea.wav 
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
whisper_model_load:    CUDA0 total size =  3094.36 MB
whisper_model_load: model size    = 3094.36 MB
whisper_backend_init_gpu: using CUDA backend
whisper_mel_init: n_len = 6000, n_len_org = 6000, n_mel = 128
whisper_init_state: kv self size  =  251.66 MB
whisper_init_state: kv cross size =  251.66 MB
whisper_init_state: kv pad  size  =    7.86 MB
whisper_init_state: compute buffer (conv)   =   36.13 MB
whisper_init_state: compute buffer (encode) =  926.53 MB
whisper_init_state: compute buffer (cross)  =    9.25 MB
whisper_init_state: compute buffer (decode) =  215.82 MB

system_info: n_threads = 4 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0

main: processing 'samples/bundestag-svea.wav' (108898656 samples, 6806.2 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

whisper_mel_init: n_len = 683616, n_len_org = 680616, n_mel = 128

[00:00:00.240 --> 00:00:08.680]   So, I welcome everyone to our 57th session of the Digital Committee to the public hearing.
[00:00:08.680 --> 00:00:13.720]   Today we have set up a single agenda item.
[00:00:13.720 --> 00:00:19.440]   I'll start with the formal ones for those who are sitting.
[00:00:19.440 --> 00:00:25.480]   So it's about the bill of the federal government, namely the draft of a bill for the implementation
CUDA error: invalid argument
  current device: 0, in function ggml_backend_cuda_graph_compute at ggml/src/ggml-cuda.cu:2664
  cudaGraphKernelNodeSetParams(cuda_ctx->cuda_graph->nodes[i], &cuda_ctx->cuda_graph->params[i])
GGML_ASSERT: ggml/src/ggml-cuda.cu:100: !"CUDA error"
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Vorgang nicht zulässig.
No stack.
The program is not being run.
Abgebrochen (Speicherabzug geschrieben)
ggerganov commented 4 months ago

Should be fixed now

MathiasSchindler commented 4 months ago

Should be fixed now

Works like a charm now. Thank you. I believe the error is now gone.