It seems that there is no performance gain utilizing Core ML

MichelBahl commented 7 months ago

I think Core ML is setup correct:

Start whisper.cpp with:

./main --language de -t 10 -m models/ggml-medium.bin -f

whisper_init_state: loading Core ML model from 'models/ggml-medium-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     6.78 MiB, ( 1738.41 / 49152.00)
whisper_init_state: compute buffer (conv)   =    8.81 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     5.86 MiB, ( 1744.27 / 49152.00)
whisper_init_state: compute buffer (cross)  =    7.85 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   130.83 MiB, ( 1875.09 / 49152.00)
whisper_init_state: compute buffer (decode) =  138.87 MB

system_info: n_threads = 10 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 1 | OPENVINO = 0

main: processing '/Users/michaelbahl/Downloads/testcast.wav' (8126607 samples, 507.9 sec), 10 threads, 1 processors, 5 beams + best of 5, lang = de, task = transcribe, timestamps = 1 ...

Runtime (COREML):

whisper_print_timings:     load time =   442.59 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =   140.54 ms
whisper_print_timings:   sample time = 13079.59 ms / 12370 runs (    1.06 ms per run)
whisper_print_timings:   encode time =  6931.83 ms /    21 runs (  330.09 ms per run)
whisper_print_timings:   decode time =   273.79 ms /    27 runs (   10.14 ms per run)
whisper_print_timings:   batchd time = 52941.25 ms / 12239 runs (    4.33 ms per run)
whisper_print_timings:   prompt time =  1136.64 ms /  4434 runs (    0.26 ms per run)
whisper_print_timings:    total time = 75668.75 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating

Runtime (normal):

whisper_print_timings:     load time =   548.92 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   144.93 ms
whisper_print_timings:   sample time = 12857.83 ms / 12239 runs (    1.05 ms per run)
whisper_print_timings:   encode time =  5827.67 ms /    21 runs (  277.51 ms per run)
whisper_print_timings:   decode time =   572.82 ms /    58 runs (    9.88 ms per run)
whisper_print_timings:   batchd time = 52036.77 ms / 12079 runs (    4.31 ms per run)
whisper_print_timings:   prompt time =  1132.30 ms /  4434 runs (    0.26 ms per run)
whisper_print_timings:    total time = 73148.27 ms

Did I miss something for an faster transcription?

ggerganov commented 7 months ago

Depending on your hardware (GPU cores / ANE cores), Core ML might or might not be faster:

https://github.com/ggerganov/whisper.cpp/discussions/1722#discussioncomment-8011884

ggerganov commented 7 months ago

Also, try to generate ANE-optimized Core ML models - this can result in extra improvement:

https://github.com/ggerganov/whisper.cpp/pull/1716

ggerganov / whisper.cpp

It seems that there is no performance gain utilizing Core ML #2057