Closed theodorDiaconu closed 1 year ago
$ ./main -m ~/Downloads/guanaco-7B.ggmlv3.q5_1.bin -p "I believe the meaning of life is " --ignore-eos -ngl 1
main: build = 614 (d1f563a)
main: seed = 1685951954
llama.cpp: loading model from /Users/theodor/Downloads/guanaco-7B.ggmlv3.q5_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 1979.59 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size = 256.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/theodor/Projects/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x119e75770
ggml_metal_init: loaded kernel_mul 0x119e81dd0
ggml_metal_init: loaded kernel_mul_row 0x119e834a0
ggml_metal_init: loaded kernel_scale 0x119e837b0
ggml_metal_init: loaded kernel_silu 0x119e84020
ggml_metal_init: loaded kernel_relu 0x119e824c0
ggml_metal_init: loaded kernel_soft_max 0x119e850c0
ggml_metal_init: loaded kernel_diag_mask_inf 0x12b11fba0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x12b11f410
ggml_metal_init: loaded kernel_rms_norm 0x12b1201d0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x12b120d40
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x12b121eb0
ggml_metal_init: loaded kernel_rope 0x12b0045d0
ggml_metal_init: loaded kernel_cpy_f32_f16 0x12b005710
ggml_metal_init: loaded kernel_cpy_f32_f32 0x12b005d40
ggml_metal_add_buffer: allocated 'data ' buffer, size = 4820.95 MB
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 768.00 MB
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 258.00 MB
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 512.00 MB
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
I believe the meaning of life is 4GGML_ASSERT: ggml-metal.m:539: false && "not implemented"
[1] 8289 abort ./main -m ~/Downloads/guanaco-7B.ggmlv3.q5_1.bin -p --ignore-eos -ngl 1
~/Projects/llama.cpp on master ⌚ 10:59:27
This is the build:
$ LLAMA_METAL=1 make
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml.c -o ggml.o c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL -c llama.cpp -o llama.o llama.cpp:1108:19: warning: unused variable 'n_gpu' [-Wunused-variable] const int n_gpu = std::min(n_gpu_layers, int(hparams.n_layer)); ^ 1 warning generated. c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL -c examples/common.cpp -o common.o cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml-metal.m -o ggml-metal.o c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/main/main.cpp ggml.o llama.o common.o ggml-metal.o -o main -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
==== Run ./main -h for help. ====
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/quantize/quantize.cpp ggml.o llama.o ggml-metal.o -o quantize -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-metal.o -o quantize-stats -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-metal.o -o perplexity -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-metal.o -o embedding -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL pocs/vdot/vdot.cpp ggml.o ggml-metal.o -o vdot -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
Because Metal support is only available for q4_0
now.
This PR implements support only for Q4_0, but all other quantizations can easily be added in the future
Missed that. Sry!
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
I believe the meaning of life is toGGML_ASSERT: ggml-metal.m:539: false && "not implemented" [1] 6869 abort sudo ./main -m ~/Downloads/Wizard-Vicuna-13B-Uncensored.ggmlv3.q5_1.bin -p