Example transcribes to garbage on Intel Mac

pfrancks commented 10 months ago

I have followed the instructions of the README, however, I am unable to trascribe the jfk example: I get I..-.,-?.,,.......,.... and,,- as AING-,.,... of..,.....,...- on a.. the ...?...,.,...-,. a......,.,.......,...,,,-,. of,?-,,.,D..,.-,.,. a......?? of.....,,-,?..,..?,...,,..., (..... H why...,-..,.'s, B-.. of. to-,.....,,..,,..,.?,,...... on,,?,-..-...,. "as transcription.

./main -nt -f samples/jfk.wav
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load: model ctx     =  140.66 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
whisper_init_state: compute buffer (conv)   =   18.50 MB
whisper_init_state: compute buffer (encode) =   81.95 MB
whisper_init_state: compute buffer (cross)  =    4.49 MB
whisper_init_state: compute buffer (decode) =   24.70 MB
whisper_init_state: Metal context initialized
whisper_init_state: max tensor size =    50.65 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 1 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 0 ...

 I..-.,-?.,,.......,.... and,,- as AING-,.,... of..,.....,...- on a.. the ...?...,.,...-,. a......,.,.......,...,,,-,. of,?-,,.,D..,.-,.,. a......?? of.....,,-,?..,..?,...,,..., (..... H why...,-..,.'s, B-.. of. to-,.....,,..,,..,.?,,...... on,,?,-..-...,. "

whisper_print_timings:     load time =   237.15 ms
whisper_print_timings:     fallbacks =   2 p /   0 h
whisper_print_timings:      mel time =    28.72 ms
whisper_print_timings:   sample time =  1480.31 ms /  1100 runs (    1.35 ms per run)
whisper_print_timings:   encode time =  3316.78 ms /     1 runs ( 3316.78 ms per run)
whisper_print_timings:   decode time =  9195.51 ms /  1098 runs (    8.37 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 14418.79 ms

Here is the build log:

make
sysctl: unknown oid 'hw.optional.arm64'
I whisper.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  i386
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_ACCELERATE -DGGML_USE_METAL
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
I CC:       Apple clang version 15.0.0 (clang-1500.0.40.1)
I CXX:      Apple clang version 15.0.0 (clang-1500.0.40.1)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_ACCELERATE -DGGML_USE_METAL   -c ggml.c -o ggml.o
cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_ACCELERATE -DGGML_USE_METAL   -c ggml-alloc.c -o ggml-alloc.o
cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_ACCELERATE -DGGML_USE_METAL   -c ggml-backend.c -o ggml-backend.o
cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_ACCELERATE -DGGML_USE_METAL   -c ggml-quants.c -o ggml-quants.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_METAL -c whisper.cpp -o whisper.o
cc -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_ACCELERATE -DGGML_USE_METAL -c ggml-metal.m -o ggml-metal.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_METAL examples/main/main.cpp examples/common.cpp examples/common-ggml.cpp ggml.o ggml-alloc.o ggml-backend.o ggml-quants.o whisper.o ggml-metal.o -o main  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_METAL examples/bench/bench.cpp ggml.o ggml-alloc.o ggml-backend.o ggml-quants.o whisper.o ggml-metal.o -o bench  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -pthread -mavx -mavx2 -mfma -mf16c -msse3 -mssse3 -DGGML_USE_METAL examples/quantize/quantize.cpp examples/common.cpp examples/common-ggml.cpp ggml.o ggml-alloc.o ggml-backend.o ggml-quants.o whisper.o ggml-metal.o -o quantize  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
./main -h

usage: ./main [options] file0.wav file1.wav ...

options:
  -h,        --help              [default] show this help message and exit
  -t N,      --threads N         [4      ] number of threads to use during computation
  -p N,      --processors N      [1      ] number of processors to use during computation
  -ot N,     --offset-t N        [0      ] time offset in milliseconds
  -on N,     --offset-n N        [0      ] segment index offset
  -d  N,     --duration N        [0      ] duration of audio to process in milliseconds
  -mc N,     --max-context N     [-1     ] maximum number of text context tokens to store
  -ml N,     --max-len N         [0      ] maximum segment length in characters
  -sow,      --split-on-word     [false  ] split on word rather than on token
  -bo N,     --best-of N         [2      ] number of best candidates to keep
  -bs N,     --beam-size N       [-1     ] beam size for beam search
  -wt N,     --word-thold N      [0.01   ] word timestamp probability threshold
  -et N,     --entropy-thold N   [2.40   ] entropy threshold for decoder fail
  -lpt N,    --logprob-thold N   [-1.00  ] log probability threshold for decoder fail
  -debug,    --debug-mode        [false  ] enable debug mode (eg. dump log_mel)
  -tr,       --translate         [false  ] translate from source language to english
  -di,       --diarize           [false  ] stereo audio diarization
  -tdrz,     --tinydiarize       [false  ] enable tinydiarize (requires a tdrz model)
  -nf,       --no-fallback       [false  ] do not use temperature fallback while decoding
  -otxt,     --output-txt        [false  ] output result in a text file
  -ovtt,     --output-vtt        [false  ] output result in a vtt file
  -osrt,     --output-srt        [false  ] output result in a srt file
  -olrc,     --output-lrc        [false  ] output result in a lrc file
  -owts,     --output-words      [false  ] output script for generating karaoke video
  -fp,       --font-path         [/System/Library/Fonts/Supplemental/Courier New Bold.ttf] path to a monospace font for karaoke video
  -ocsv,     --output-csv        [false  ] output result in a CSV file
  -oj,       --output-json       [false  ] output result in a JSON file
  -ojf,      --output-json-full  [false  ] include more information in the JSON file
  -of FNAME, --output-file FNAME [       ] output file path (without file extension)
  -ps,       --print-special     [false  ] print special tokens
  -pc,       --print-colors      [false  ] print colors
  -pp,       --print-progress    [false  ] print progress
  -nt,       --no-timestamps     [false  ] do not print timestamps
  -l LANG,   --language LANG     [en     ] spoken language ('auto' for auto-detect)
  -dl,       --detect-language   [false  ] exit after automatically detecting language
             --prompt PROMPT     [       ] initial prompt
  -m FNAME,  --model FNAME       [models/ggml-base.en.bin] model path
  -f FNAME,  --file FNAME        [       ] input WAV file path
  -oved D,   --ov-e-device DNAME [CPU    ] the OpenVINO device used for encode inference
  -ls,       --log-score         [false  ] log best decoder scores of tokens
  -ng,       --no-gpu            [false  ] disable GPU

bobqianic commented 10 months ago

Could you test what happens if you remove the -nt flag?

ggerganov commented 10 months ago

On Intel Macs, you have to run without the GPU. Try adding the -ng flag

pfrancks commented 10 months ago

I tried both, -ng and -nt and nothing helped. However, I think this must be related to Metal. Whisper.cpp works very slowly (but works) if I build with

 WHISPER_NO_METAL=1 make

ggerganov commented 10 months ago

Hm, the -ng option should be equivalent to building with WHISPER_NO_METAL=1

It works on my 2013 Intel MacBook

aureleq commented 10 months ago

@pfrancks the -ng works fine too on my intel 2020 MacBook Pro:

./main -ng -m models/ggml-base.en.bin -f samples/jfk.wav
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU buffer size =   149.41 MB
whisper_model_load: model size    =  149.32 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
whisper_init_state: compute buffer (conv)   =   14.11 MB
whisper_init_state: compute buffer (encode) =   81.95 MB
whisper_init_state: compute buffer (cross)  =    4.49 MB
whisper_init_state: compute buffer (decode) =   24.70 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 1 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

ggerganov / whisper.cpp

Example transcribes to garbage on Intel Mac #1481