(M1) Running with -ngl 1 gives error: GGML_ASSERT: ggml-metal.m:539: false && "not implemented"

theodorDiaconu commented 1 year ago

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

I believe the meaning of life is toGGML_ASSERT: ggml-metal.m:539: false && "not implemented" [1] 6869 abort sudo ./main -m ~/Downloads/Wizard-Vicuna-13B-Uncensored.ggmlv3.q5_1.bin -p

theodorDiaconu commented 1 year ago

$ ./main -m ~/Downloads/guanaco-7B.ggmlv3.q5_1.bin -p "I believe the meaning of life is " --ignore-eos -ngl 1
main: build = 614 (d1f563a)
main: seed  = 1685951954
llama.cpp: loading model from /Users/theodor/Downloads/guanaco-7B.ggmlv3.q5_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 1979.59 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/theodor/Projects/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x119e75770
ggml_metal_init: loaded kernel_mul                            0x119e81dd0
ggml_metal_init: loaded kernel_mul_row                        0x119e834a0
ggml_metal_init: loaded kernel_scale                          0x119e837b0
ggml_metal_init: loaded kernel_silu                           0x119e84020
ggml_metal_init: loaded kernel_relu                           0x119e824c0
ggml_metal_init: loaded kernel_soft_max                       0x119e850c0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x12b11fba0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x12b11f410
ggml_metal_init: loaded kernel_rms_norm                       0x12b1201d0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x12b120d40
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x12b121eb0
ggml_metal_init: loaded kernel_rope                           0x12b0045d0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x12b005710
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x12b005d40
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  4820.95 MB
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =   768.00 MB
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   258.00 MB
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

 I believe the meaning of life is 4GGML_ASSERT: ggml-metal.m:539: false && "not implemented"
[1]    8289 abort      ./main -m ~/Downloads/guanaco-7B.ggmlv3.q5_1.bin -p  --ignore-eos -ngl 1

~/Projects/llama.cpp on  master ⌚ 10:59:27

theodorDiaconu commented 1 year ago

This is the build:

$ LLAMA_METAL=1 make
I llama.cpp build info: I UNAME_S: Darwin I UNAME_P: arm I UNAME_M: arm64 I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1) I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml.c -o ggml.o c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL -c llama.cpp -o llama.o llama.cpp:1108:19: warning: unused variable 'n_gpu' [-Wunused-variable] const int n_gpu = std::min(n_gpu_layers, int(hparams.n_layer)); ^ 1 warning generated. c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL -c examples/common.cpp -o common.o cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml-metal.m -o ggml-metal.o c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/main/main.cpp ggml.o llama.o common.o ggml-metal.o -o main -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders

==== Run ./main -h for help. ====

c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/quantize/quantize.cpp ggml.o llama.o ggml-metal.o -o quantize -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-metal.o -o quantize-stats -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-metal.o -o perplexity -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-metal.o -o embedding -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL pocs/vdot/vdot.cpp ggml.o ggml-metal.o -o vdot -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders

ymcui commented 1 year ago

Because Metal support is only available for q4_0 now.

This PR implements support only for Q4_0, but all other quantizations can easily be added in the future

ref. https://github.com/ggerganov/llama.cpp/pull/1642

theodorDiaconu commented 1 year ago

Missed that. Sry!

ggerganov / llama.cpp

(M1) Running with -ngl 1 gives error: GGML_ASSERT: ggml-metal.m:539: false && "not implemented" #1697