Running on GPU is slower.

100tomer commented 10 months ago

Hi. I don't know why but when I tried the new release on both M1 Pro and M2 Pro it's much slower than before. I do see it use 100% of the GPU now but compared to the cpu it takes more time. On GPU my sample audio takes 3 minutes and on CPU it takes little bit above 1 minute.

ggerganov commented 10 months ago

The old release was using -bs 1. The new release uses -bs 5. Try matching it

100tomer commented 10 months ago

The old release was using -bs 1. The new release uses -bs 5. Try matching it

Tried that but nothing changed. Also before I was using the default for for both tests. My new test on M1 Pro: CPU with default beam size, 8/10 threads takes 1:22 GPU with beam size 5, 8/10 threads it takes 2:32

Also is the thread count matters when running using the GPU ?

Alvarocda commented 10 months ago

I also noticed the difference when using GPU.

I have a 17-minute audio that I always use to test. In version 1.4.0 transcription was taking 15 minutes In version 1.4.3 (BETA) transcription was taking 13 minutes But in version 1.5.0 the transcription time increased to 20 minutes

Same hardware, I tried running with the -bs 1 parameter, but the performance still didn't improve.

These tests were performed on the same computer.

NVIDIA T1000 8GB
Intel(R) Xeon(R) W-1270 CPU @ 3.40GHz
Linux

Alvarocda commented 10 months ago

I don't know if this helps, but here are the output logs of the transcription performed in version 1.5.0 and 1.4.3

Version 1.5.0

./main -f youtube.wav -l portuguese -m ../../ggml-large-v3.bin -bs 1 -debug      
whisper_init_from_file_with_params_no_state: loading model from '../../ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA T1000 8GB, compute capability 7.5
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  3117.87 MB
whisper_model_load: model size    = 3117.39 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.36 MB
whisper_init_state: compute buffer (encode) =  212.36 MB
whisper_init_state: compute buffer (cross)  =    9.32 MB
whisper_init_state: compute buffer (decode) =   99.17 MB

system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'youtube.wav' (960006 samples, 60.0 sec), 4 threads, 1 processors, 1 beams + best of 5, lang = portuguese, task = transcribe, timestamps = 1 ...

whisper_print_timings:     load time =  1199.26 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    48.17 ms
whisper_print_timings:   sample time =   153.65 ms /     1 runs (  153.65 ms per run)
whisper_print_timings:   encode time = 21681.65 ms /     3 runs ( 7227.22 ms per run)
whisper_print_timings:   decode time = 40703.03 ms /   342 runs (  119.01 ms per run)
whisper_print_timings:   batchd time =  2076.69 ms /     6 runs (  346.12 ms per run)
whisper_print_timings:   prompt time =  2319.00 ms /   176 runs (   13.18 ms per run)
whisper_print_timings:    total time = 68187.45 ms

Version 1.4.3

 ./main -f youtube.wav -l portuguese -m ../ggml-large-v3.bin -debug               
whisper_init_from_file_with_params_no_state: loading model from '../ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
whisper_model_load: model ctx     = 2951.63 MB
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA T1000 8GB, compute capability 7.5
whisper_model_load: model size    = 2951.01 MB
whisper_init_state: kv self size  =   70.00 MB
whisper_init_state: kv cross size =  234.38 MB
whisper_init_state: compute buffer (conv)   =   41.85 MB
whisper_init_state: compute buffer (encode) =  202.52 MB
whisper_init_state: compute buffer (cross)  =    8.89 MB
whisper_init_state: compute buffer (decode) =   59.40 MB

system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 | 

main: processing 'youtube.wav' (960006 samples, 60.0 sec), 4 threads, 1 processors, lang = portuguese, task = transcribe, timestamps = 1 ...

whisper_print_timings:     load time =  1555.60 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    48.63 ms
whisper_print_timings:   sample time =   140.16 ms /   345 runs (    0.41 ms per run)
whisper_print_timings:   encode time = 31458.67 ms /     3 runs (10486.22 ms per run)
whisper_print_timings:   decode time = 21280.85 ms /   342 runs (   62.22 ms per run)
whisper_print_timings:   prompt time =  1931.79 ms /     3 runs (  643.93 ms per run)
whisper_print_timings:    total time = 56584.43 ms

ggerganov commented 9 months ago

@Alvarocda Could you please run the same test using https://github.com/ggerganov/whisper.cpp/pull/1559 and post the output?

Alvarocda commented 9 months ago

@Alvarocda Could you please run the same test using #1559 and post the output?

WOW, Now it's much faster

 ./main -f youtube.wav -l portuguese -m ../../ggml-large-v3.bin -bs 1 -debug
whisper_init_from_file_with_params_no_state: loading model from '../../ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA T1000 8GB, compute capability 7.5
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  3117.87 MB
whisper_model_load: model size    = 3117.39 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.36 MB
whisper_init_state: compute buffer (encode) =  212.36 MB
whisper_init_state: compute buffer (cross)  =    9.32 MB
whisper_init_state: compute buffer (decode) =   99.17 MB

system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'youtube.wav' (960006 samples, 60.0 sec), 4 threads, 1 processors, 1 beams + best of 5, lang = portuguese, task = transcribe, timestamps = 1 ...

whisper_print_timings:     load time =  1181.33 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    49.89 ms
whisper_print_timings:   sample time =   142.15 ms /     1 runs (  142.15 ms per run)
whisper_print_timings:   encode time = 22455.02 ms /     3 runs ( 7485.01 ms per run)
whisper_print_timings:   decode time = 11801.63 ms /   342 runs (   34.51 ms per run)
whisper_print_timings:   batchd time =  2177.27 ms /     6 runs (  362.88 ms per run)
whisper_print_timings:   prompt time =  2268.01 ms /   176 runs (   12.89 ms per run)
whisper_print_timings:    total time = 40081.00 ms

I have a 17-minute audio that I always use to test. In version 1.4.0 transcription was taking 15 minutes In version 1.4.3 (BETA) transcription was taking 13 minutes But in version 1.5.0 the transcription time increased to 20 minutes

Now that same 17-minute audio is being transcribed in 10 minutes.

100tomer commented 9 months ago

wait, how did you solve it all of a sudden?

ggerganov / whisper.cpp

Running on GPU is slower. #1540