What kind of performance can we expect?

genglinxiao commented 5 months ago

I'm experimenting the streaming mode on a M2 Macbook Air and found something like 1/3 of the voice are not recognized - Is that expected or do I need more RAM or something else went wrong? I tried both medium and large_v3 modes. Here's one of the command and its initial output: `./stream --model models/ggml-large-v3.bin --language zh --step 0 --length 4000 init: found 2 capture devices: init: - Capture device #0: 'MacBook Air麦克风' init: - Capture device #1: 'Microsoft Teams Audio' init: attempt to open default capture device ... init: obtained spec for input device (SDL Id = 2): init: - sample rate: 16000 init: - format: 33056 (required: 33056) init: - channels: 1 (required: 1) init: - samples per frame: 1024 whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin' whisper_model_load: loading model whisper_model_load: n_vocab = 51866 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 1280 whisper_model_load: n_audio_head = 20 whisper_model_load: n_audio_layer = 32 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 1280 whisper_model_load: n_text_head = 20 whisper_model_load: n_text_layer = 32 whisper_model_load: n_mels = 128 whisper_model_load: ftype = 1 whisper_model_load: qntvr = 0 whisper_model_load: type = 5 (large v3) whisper_model_load: adding 1609 extra tokens whisper_model_load: n_langs = 100 whisper_backend_init: using Metal backend ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 ggml_metal_init: picking default device: Apple M2 ggml_metal_init: default.metallib not found, loading from source ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil ggml_metal_init: loading '/Users/linxiaogeng/whisper.cpp/ggml-metal.metal' ggml_metal_init: GPU name: Apple M2 ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 17179.89 MB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 2951.02 MiB, ( 2952.89 / 16384.02) whisper_model_load: Metal total size = 3094.36 MB whisper_model_load: model size = 3094.36 MB whisper_backend_init: using Metal backend ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 ggml_metal_init: picking default device: Apple M2 ggml_metal_init: default.metallib not found, loading from source ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil ggml_metal_init: loading '/Users/linxiaogeng/whisper.cpp/ggml-metal.metal' ggml_metal_init: GPU name: Apple M2 ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 17179.89 MB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 210.00 MiB, ( 3163.89 / 16384.02) whisper_init_state: kv self size = 220.20 MB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 234.38 MiB, ( 3398.27 / 16384.02) whisper_init_state: kv cross size = 245.76 MB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 32.97 MiB, ( 3431.23 / 16384.02) whisper_init_state: compute buffer (conv) = 36.26 MB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 889.44 MiB, ( 4320.67 / 16384.02) whisper_init_state: compute buffer (encode) = 934.34 MB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 7.33 MiB, ( 4328.00 / 16384.02) whisper_init_state: compute buffer (cross) = 9.38 MB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 197.95 MiB, ( 4525.95 / 16384.02) whisper_init_state: compute buffer (decode) = 209.26 MB

main: processing 0 samples (step = 0.0 sec / len = 4.0 sec / keep = 0.0 sec), 4 threads, lang = zh, task = transcribe, timestamps = 1 ... main: using VAD, will transcribe on speech activity

[Start speaking] `

jensdraht1999 commented 2 months ago

@genglinxiao I think, the large model v3 has some kind of bug, where it does not work properly, which cant be fixed. The medium model should be okay, but far from perfect and has never received v2 model, so the performance should not be great too. I would try large v1 and v2. Please close this issue.

genglinxiao commented 2 months ago

Thanks. I'm closing this issue now.

ggerganov / whisper.cpp

What kind of performance can we expect? #2157