ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
34.98k stars 3.57k forks source link

talk-llama does not recognize quantized llama models (unrecognized tensor type 14) #1085

Open kallewoof opened 1 year ago

kallewoof commented 1 year ago
$ cd ../llama.cpp/
$ ./quantize ../llm/CalderaAI_30B-Lazarus/ggml-model-f16.bin ../llm/CalderaAI_30B-Lazarus/ggml-model-q4_1.bin 3 |
tail
main: build = 796 (31cfbb1)
main: quantizing '../llm/CalderaAI_30B-Lazarus/ggml-model-f16.bin' to '../llm/CalderaAI_30B-Lazarus/ggml-model-q4_1.bin' as Q4_1
llama.cpp: loading model from ../llm/CalderaAI_30B-Lazarus/ggml-model-f16.bin
llama.cpp: saving model to ../llm/CalderaAI_30B-Lazarus/ggml-model-q4_1.bin
[ 540/ 543]     layers.59.feed_forward.w1.weight -     6656 x 17920, type =    f16, quantizing .. size =   227.50 MB ->    71.09 MB | hist: 0.040 0.025 0.037 0.051 0.067 0.083 0.095 0.102 0.102 0.095 0.082 0.067 0.051 0.037 0.025 0.040
[ 541/ 543]     layers.59.feed_forward.w2.weight -    17920 x  6656, type =    f16, quantizing .. size =   227.50 MB ->    71.09 MB | hist: 0.040 0.024 0.036 0.050 0.067 0.083 0.096 0.104 0.104 0.096 0.083 0.067 0.050 0.036 0.025 0.040
[ 542/ 543]     layers.59.feed_forward.w3.weight -     6656 x 17920, type =    f16, quantizing .. size =   227.50 MB ->    71.09 MB | hist: 0.040 0.025 0.037 0.051 0.067 0.083 0.095 0.102 0.102 0.095 0.083 0.067 0.051 0.037 0.025 0.040
[ 543/ 543]            layers.59.ffn_norm.weight -             6656, type =    f32, size =    0.025 MB
llama_model_quantize_internal: model size  = 62045.57 MB
llama_model_quantize_internal: quant size  = 19431.03 MB
llama_model_quantize_internal: hist: 0.040 0.025 0.037 0.051 0.067 0.083 0.095 0.102 0.102 0.095 0.082 0.067 0.051 0.037 0.025 0.040

main: quantize time = 48014.85 ms
main:    total time = 48014.85 ms
$ cd ../whisper.cpp
$ ./talk-llama -mw models/ggml-base.en.bin -ml ~/workspace/llm/CalderaAI_30B-Lazarus/ggml-model-q4_1.bin -p "Fluffy" -t 8
whisper_init_from_file_no_state: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2
whisper_model_load: mem required  =  310.00 MB (+    6.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx     =  140.66 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
whisper_init_state: loading Core ML model from 'models/ggml-base.en-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded
llama.cpp: loading model from /Users/user/workspace/llm/CalderaAI_30B-Lazarus/ggml-model-q4_1.bin
error loading model: unrecognized tensor type 14

llama_init_from_file: failed to load model

main: processing, 8 threads, lang = en, task = transcribe, timestamps = 0 ...

init: found 2 capture devices:
init:    - Capture device #0: 'MacBook Pro Microphone'
init:    - Capture device #1: 'Microsoft Teams Audio'
init: attempt to open default capture device ...
init: obtained spec for input device (SDL Id = 2):
init:     - sample rate:       16000
init:     - format:            33056 (required: 33056)
init:     - channels:          1 (required: 1)
init:     - samples per frame: 1024
[1]    39483 segmentation fault  ./talk-llama -mw models/ggml-base.en.bin -ml  -p "Fluffy" -t 8

This

error loading model: unrecognized tensor type 14

is from

https://github.com/ggerganov/whisper.cpp/blob/4774d2feb01a772a15de81ffc34b34a1f294f020/examples/talk-llama/llama.cpp#L488-L498

despite me using type 3 above in ./quantize call. Type 14 is Q4_K_S.

If I use a q5_k_m model it complains about tensor type 13, despite Q5_K_M actually being 17, so I think there is a file format issue going on.

Probably unrelated: I had to manually modify the c++ line (MacBook Pro M1 16 inch on Ventura) for talk-llama to

c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_DARWIN_C_SOURCE -pthread examples/talk-llama/talk-llama.cpp examples/talk-llama/llama.cpp examples/common.cpp examples/common-ggml.cpp examples/common-sdl.cpp ggml.o whisper.o -o talk-llama `sdl2-config --cflags --libs` -lobjc -framework Cocoa -framework Accelerate -framework CoreML whisper-encoder.o whisper-encoder-impl.o

(i.e. I added -lobjc -framework Cocoa -framework CoreML whisper-encoder.o whisper-encoder-impl.o) in order for it to compile without linker issues. As I said, I don't think that is related to the above issue though.

petterreinholdtsen commented 7 months ago

I suspect this issue is related to issue #1186, and is solved by updating llama.cpp. According to https://github.com/ggerganov/whisper.cpp/commits/master/examples/talk-llama/llama.cpp, llama.cpp has been updated several times since you reported this issue. Perhaps it is already solved?