Whisper crashes calling ggml_init() with CUDA enabled

CassinianSoftware commented 1 month ago

I'm running Visual Studio 2022 (latest update) and CUDA 12.6 on a Dell T5600 with a pair of GeForce 1050 Ti GPUs (which I realize are old Pascal chips) and Windows 10 (latest update). I compiled WhisperCpp, ggml, and SLD2 without issue (as static libs) and tested using the command.cpp demo console app.

Fun app. Worked fine, but performance was sluggish. So I set GGML_USE_CUDA and recompiled with CUDA and cuBLAS in a bid to improve performance. After a bit of trial-and-error getting everything to compile and link, I was able to test again with command.exe. Unfortunately, whisper.cpp is now crashing at "model.ctx = ggml_init(params);" at around line 1620. Execution never gets to "if (!model.ctx)" so the error "ggml_init() failed" is not displayed.

It seems like an issue with memory allocation, the value of "n_tensors ggml_tensor_overhead()" in "params" but I'm not sure about the value of "n_tensors" because it contains hard coded values (i.e. 10 + 15 + 15 n_audio_layer + 24 * n_text_layer). What do 10, 15, 15, and 24 represent? The proper allocation of memory with ggml_init() seems important, but this appears an odd way to calculate it.

Or, am I chasing the wrong problem? Any suggestions would be most appreciated. Thanks!

UPDATE: I've been able to confirm that ggml is crashing WhisperCpp in ggml.c at this line: "float f = ggml_table_f32_f16[i] = GGML_COMPUTE_FP16_TO_FP32(u.fp16);"

..which is in the function ggml_init() at around line 3469. The above offending line is around line 3500 in this function, in ggml.c. Not sure why this would be an issue when CUDA is enabled, but not when CUDA is not used.???

ggerganov commented 1 month ago

Hm not sure. I just tested the latest master on my Linux CUDA box and all seems good:

$ make clean && GGML_CUDA=1 make -j && ./main -m models/ggml-base.en.bin -f ./samples/jfk.wav

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes
whisper_model_load:    CUDA0 total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_backend_init_gpu: using CUDA backend
whisper_init_state: kv self size  =   18.87 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   16.26 MB
whisper_init_state: compute buffer (encode) =  131.94 MB
whisper_init_state: compute buffer (cross)  =    4.65 MB
whisper_init_state: compute buffer (decode) =   98.19 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | CANN = 0

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

whisper_print_timings:     load time =   339.47 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    12.37 ms
whisper_print_timings:   sample time =    34.35 ms /   131 runs (    0.26 ms per run)
whisper_print_timings:   encode time =    54.12 ms /     1 runs (   54.12 ms per run)
whisper_print_timings:   decode time =     9.23 ms /     2 runs (    4.62 ms per run)
whisper_print_timings:   batchd time =    59.30 ms /   125 runs (    0.47 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   514.95 ms

CassinianSoftware commented 1 month ago

Thanks for checking! I think I have it fixed. My problem appears to have been with the x64 intrinsic '_mm_cvtss_f32()' not with CUDA. When I reconfigured my VC++ compiler options for CUDA, I appear to have inadvertently caused the definition of your 'GGML_COMPUTE_FP16_TO_FP32' to change. It should have been:

'ggml_compute_fp16_to_fp32(x)'

..but instead, switched to this:

'_mm_cvtss_f32(_mm_cvtph_ps(_mm_cvtsi32_si128(x)))'

..because my enhanced instruction set option was incorrectly set for AVX2 (i.e. /arch:AVX2). My bad. This all appears to have been related to the Windows 10 SDK and completely unrelated to WhisperCpp or CUDA. When I switched /arch from AVX2 to SSE2, it appeared to work fine. I suspect the problem arose because my Xeon CPUs predate the release of AVX2.

I should have traced this further on my end before posting it as an issue. But thanks again for taking a look. This is a very nice project, I suspect that quite a bit of work when into it.

ggerganov / whisper.cpp

Whisper crashes calling ggml_init() with CUDA enabled #2421