Closed leedrake5 closed 1 year ago
I was able to patch this locally:
# Load the library
def _load_shared_library(lib_base_name: str):
# Determine the file extension based on the platform
if sys.platform.startswith("linux"):
lib_ext = ".so"
elif sys.platform == "darwin":
lib_ext = ".dylib" # <<< Was also ".so"
However, I still seem to get crashes trying to load models subsequently:
llama_model_load_internal: mem required = 2532.67 MB (+ 3124.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size = 3120.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '(null)'
ggml_metal_init: error: Error Domain=NSCocoaErrorDomain Code=258 "The file name is invalid."
/Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/cpp/src/arrow/filesystem/s3fs.cc:2598: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit
Edit: Best I can tell, it's failing to load the Metal shader for some reason, and it seems like that's supposed to be embedded into the dylib somehow?
// read the source from "ggml-metal.metal" into a string and use newLibraryWithSource
{
NSError * error = nil;
//NSString * path = [[NSBundle mainBundle] pathForResource:@"../../examples/metal/metal" ofType:@"metal"];
NSString * path = [[NSBundle mainBundle] pathForResource:@"ggml-metal" ofType:@"metal"];
fprintf(stderr, "%s: loading '%s'\n", __func__, [path UTF8String]);
NSString * src = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:&error];
if (error) {
fprintf(stderr, "%s: error: %s\n", __func__, [[error description] UTF8String]);
exit(1);
}
"Fixed" the second issue by copying llama.cpp/ggml-metal.metal
to the same directory as my python binary!
So it seems like the upstream llama.cpp Makefile and CMakeLists disagree about what the extension of the shared library should be. Per this discussion, you can force libllama to be generated with the .so
extension instead of .dylib
by adding the MODULE
keyword here:
add_library(llama MODULE
llama.cpp
llama.h
llama-util.h
)
Not clear to me if this might negatively impact other platforms, but it's enough to make FORCE_CMAKE builds generate the expected libllama.so
, rather than a libllama.dylib
that the Python has trouble finding.
Thanks @zach-brockway, I can successfully get it to load with this bit:
# Load the library
def _load_shared_library(lib_base_name: str):
# Determine the file extension based on the platform
if sys.platform.startswith("linux"):
lib_ext = ".so"
elif sys.platform == "darwin":
lib_ext = ".dylib" # <<< Was also ".so"
This goes into llama_cpp.py in the site-packages folder. However, it still only uses CPU, not GPU, even when I copy llama.cpp/ggml-metal.metal to site-packages/llama_cpp. I suspect it's because I don't know where this is suppose to go:
// read the source from "ggml-metal.metal" into a string and use newLibraryWithSource
{
NSError * error = nil;
//NSString * path = [[NSBundle mainBundle] pathForResource:@"../../examples/metal/metal" ofType:@"metal"];
NSString * path = [[NSBundle mainBundle] pathForResource:@"ggml-metal" ofType:@"metal"];
fprintf(stderr, "%s: loading '%s'\n", __func__, [path UTF8String]);
NSString * src = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:&error];
if (error) {
fprintf(stderr, "%s: error: %s\n", __func__, [[error description] UTF8String]);
exit(1);
}
Also where does this function go?
add_library(llama MODULE
llama.cpp
llama.h
llama-util.h
)
Also where does this function go?
add_library(llama MODULE llama.cpp llama.h llama-util.h )
That was a modification I had to make to vendor/llama.cpp/CMakeLists.txt
(line 412). The workarounds are starting to pile up at every level! But I'm pretty sure llama.cpp will in fact want to take a fix for this upstream. Either their static Makefile should output .dylib, or their CMakeLists.txt should output .so; no good reason for the current discrepancy.
Many thanks, but worried this may be a dead end. If I use this version in CMakeLists.txt:
add_library(llama MODULE
llama.cpp
llama.h
llama-util.h
)
I get this error:
CMake Error at CMakeLists.txt:418 (target_include_directories):
Cannot specify include directories for target "llama" which is not built by
this project.
CMake Error at tests/CMakeLists.txt:4 (target_link_libraries):
Target "llama" of type MODULE_LIBRARY may not be linked into another
target. One may link only to INTERFACE, OBJECT, STATIC or SHARED
libraries, or to executables with the ENABLE_EXPORTS property set.
Call Stack (most recent call first):
tests/CMakeLists.txt:9 (llama_add_test)
If I try to make it more explicit:
add_library(llama.so
llama.cpp
llama.h
llama-util.h
)
I get the same error. I really appreciate your help trying to workaround this, but I think you are right, this needs to happen upstream. It works fine by command line, but interfacing with the python package makes it very difficult.
Sorry for the slow reply, I should be able to get access to an M1 tonight and get this sorted, cheers.
@abetlen sounds awesome. Please let me know if you're having issues and I'll let you ssh into one of mine :)
@abetlen hey any updates? This would be an amazing update!
"Fixed" the second issue by copying
llama.cpp/ggml-metal.metal
to the same directory as my python binary!
I tried copying ggml-metal.metal
to multiple locations but still got the "file name is invalid" error.
Eventually, I "fixed" it by hardcoding the absolute path in vendor/llama.cpp/ggml-metal.m
around line 101:
//NSString * path = [[NSBundle mainBundle] pathForResource:@"ggml-metal" ofType:@"metal"];
NSString * path = @"/path/to/vendor/llama.cpp/ggml-metal.metal";
Then recompile it.
I added an option to llama_cpp.py
to accept both .so
and .dylib
extensions on macos.
"Fixed" the second issue by copying
llama.cpp/ggml-metal.metal
to the same directory as my python binary!
@zach-brockway can you expand on this?
@abetlen how should we install llama-cpp-python to make it work with LLAMA_METAL?
I added an option to
llama_cpp.py
to accept both.so
and.dylib
extensions on macos.
"Fixed" the second issue by copying
llama.cpp/ggml-metal.metal
to the same directory as my python binary!@zach-brockway can you expand on this?
Sure! So the ggml_metal_init
errors I was receiving when attempting to load a model (loading '(null)' / error: Error Domain=NSCocoaErrorDomain Code=258 "The file name is invalid."
) turned out to be attributable to the llama.cpp code I quoted in the edit to my first comment, where it tries to locate the ggml-metal.metal
shader file using NSBundle pathForResource:ofType:
.
To work around this, I ended up running the equivalent of the following command: cp vendor/llama.cpp/ggml-metal.metal $(dirname $(which python))
(the destination, in my case, was something like /opt/homebrew/Caskroom/miniconda/base/envs/mycondaenv/bin
).
It seems like better upstream fixes might be something like having the shared library look alongside where it's located on disk, or ideally even embedding the shader into the dylib at build time somehow (since if the compiled code and shader get out of sync, that can also cause crashes).
"Fixed" the second issue by copying
llama.cpp/ggml-metal.metal
to the same directory as my python binary!@zach-brockway can you expand on this?
Sure! So the
ggml_metal_init
errors I was receiving when attempting to load a model (loading '(null)' / error: Error Domain=NSCocoaErrorDomain Code=258 "The file name is invalid."
) turned out to be attributable to the llama.cpp code I quoted in the edit to my first comment, where it tries to locate theggml-metal.metal
shader file usingNSBundle pathForResource:ofType:
.To work around this, I ended up running the equivalent of the following command:
cp vendor/llama.cpp/ggml-metal.metal $(dirname $(which python))
(the destination, in my case, was something like/opt/homebrew/Caskroom/miniconda/base/envs/mycondaenv/bin
).It seems like better upstream fixes might be something like having the shared library look alongside where it's located on disk, or ideally even embedding the shader into the dylib at build time somehow (since if the compiled code and shader get out of sync, that can also cause crashes).
Thanks, I tried to do the same. In my case my python is located at venv/bin/python
so I copied ggml-metal.metal
to venv/bin
. It didn't work though. The only way I could make it work is to hardcode the NSString path in ggml-metal.metal
.
Thanks, I tried to do the same. In my case my python is located at
venv/bin/python
so I copiedggml-metal.metal
tovenv/bin
. It didn't work though. The only way I could make it work is to hardcode the NSString path inggml-metal.metal
.
venv
is a special case I think, the bin
directory just contains symlinks to the underlying Python distribution that was active at the time you created the environment:
$ ls -lha python*
lrwxr-xr-x 1 zach staff 7B May 21 22:33 python -> python3
lrwxr-xr-x 1 zach staff 49B May 21 22:33 python3 -> /opt/homebrew/Caskroom/miniconda/base/bin/python3
lrwxr-xr-x 1 zach staff 7B May 21 22:33 python3.9 -> python3
A workaround might be to use cp vendor/llama.cpp/ggml-metal.metal $(dirname $(realpath $(which python)))
, where realpath
resolves the symlink first.
@zach-brockway I think you're right that this requires a change to how llama.cpp is built as a shared library. I'll try to work on a PR for that but I only have remote access to a Mac so if anyone else is a better cmake ninja and has a mac in front of them I would really appreciate the help.
Trying to build as a shared object as part of another project yields this result. Best to ignore the problem with python and focus on the core issue. I'll see if I can wrangle a fix
@zach-brockway I think you're right that this requires a change to how llama.cpp is built as a shared library. I'll try to work on a PR for that but I only have remote access to a Mac so if anyone else is a better cmake ninja and has a mac in front of them I would really appreciate the help.
@abetlen I've spent an hour or so doing different built variants trying to isolate (and fix?) this issue.. so far without success. I have had llama-cpp-python working a couple of times.. but haven't yet isolated the a reproducable/working install process for MacOS Metal.
Just pushed v0.1.62 that includes Metal support, let me know if that works!
@abetlen: installed with CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python
and it works:
INFO:Loading 7B...
INFO:llama.cpp weights detected: models/7B/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin
INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models/7B/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size = 1024.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/wojtek/Documents/text-generation-webui/venv/lib/python3.10/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x16a366ec0
ggml_metal_init: loaded kernel_mul 0x16a3674f0
ggml_metal_init: loaded kernel_mul_row 0x16a3678f0
ggml_metal_init: loaded kernel_scale 0x16a367cf0
ggml_metal_init: loaded kernel_silu 0x16a3680f0
ggml_metal_init: loaded kernel_relu 0x16a3684f0
ggml_metal_init: loaded kernel_gelu 0x16a3688f0
ggml_metal_init: loaded kernel_soft_max 0x16a368e80
ggml_metal_init: loaded kernel_diag_mask_inf 0x16a369280
ggml_metal_init: loaded kernel_get_rows_f16 0x1209d5250
ggml_metal_init: loaded kernel_get_rows_q4_0 0x1209ecd30
ggml_metal_init: loaded kernel_get_rows_q4_1 0x1209edb70
ggml_metal_init: loaded kernel_get_rows_q2_k 0x1209ee300
ggml_metal_init: loaded kernel_get_rows_q4_k 0x16a369680
ggml_metal_init: loaded kernel_get_rows_q6_k 0x16a369be0
ggml_metal_init: loaded kernel_rms_norm 0x16a36a170
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x16a36a8b0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x16a36aff0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x1209ee8a0
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32 0x1209ef0d0
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32 0x1209ef670
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32 0x16a36b730
ggml_metal_init: loaded kernel_rope 0x16a36bf00
ggml_metal_init: loaded kernel_cpy_f32_f16 0x16a36c7f0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x1209efc40
ggml_metal_add_buffer: allocated 'data ' buffer, size = 3616.08 MB
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 768.00 MB
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1026.00 MB
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 512.00 MB
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
INFO:Loaded the model in 1.04 seconds.
llama_print_timings: load time = 6887.87 ms
llama_print_timings: sample time = 133.05 ms / 77 runs ( 1.73 ms per token)
llama_print_timings: prompt eval time = 6887.83 ms / 16 tokens ( 430.49 ms per token)
llama_print_timings: eval time = 8762.61 ms / 76 runs ( 115.30 ms per token)
llama_print_timings: total time = 16282.09 ms
Output generated in 16.53 seconds (4.60 tokens/s, 76 tokens, context 16, seed 1703054888)
Llama.generate: prefix-match hit
llama_print_timings: load time = 6887.87 ms
llama_print_timings: sample time = 229.93 ms / 77 runs ( 2.99 ms per token)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token)
llama_print_timings: eval time = 7226.16 ms / 77 runs ( 93.85 ms per token)
llama_print_timings: total time = 8139.00 ms
Output generated in 8.44 seconds (9.01 tokens/s, 76 tokens, context 16, seed 1286945878)
Llama.generate: prefix-match hit
llama_print_timings: load time = 6887.87 ms
llama_print_timings: sample time = 133.86 ms / 77 runs ( 1.74 ms per token)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token)
llama_print_timings: eval time = 7927.91 ms / 77 runs ( 102.96 ms per token)
llama_print_timings: total time = 8573.75 ms
Output generated in 8.84 seconds (8.60 tokens/s, 76 tokens, context 16, seed 708232749)
@abetlen running CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python
inside a virtual environment or inside conda environment doesn't solve the problem - the model still only uses CPU:
llama.cpp: loading model from /Users/peter/_Git/_GPT/llama.cpp/models/wizardLM-7B.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 5407.72 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size = 256.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
llama_print_timings: load time = 634.15 ms
llama_print_timings: sample time = 229.50 ms / 333 runs ( 0.69 ms per token)
llama_print_timings: prompt eval time = 634.07 ms / 11 tokens ( 57.64 ms per token)
llama_print_timings: eval time = 13948.15 ms / 332 runs ( 42.01 ms per token)
llama_print_timings: total time = 16233.21 ms
@abetlen running
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python
inside a virtual environment or inside conda environment doesn't solve the problem - the model still only uses CPU:llama.cpp: loading model from /Users/peter/_Git/_GPT/llama.cpp/models/wizardLM-7B.ggmlv3.q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32001 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.07 MB llama_model_load_internal: mem required = 5407.72 MB (+ 1026.00 MB per state) . llama_init_from_file: kv self size = 256.00 MB AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | llama_print_timings: load time = 634.15 ms llama_print_timings: sample time = 229.50 ms / 333 runs ( 0.69 ms per token) llama_print_timings: prompt eval time = 634.07 ms / 11 tokens ( 57.64 ms per token) llama_print_timings: eval time = 13948.15 ms / 332 runs ( 42.01 ms per token) llama_print_timings: total time = 16233.21 ms
Have you updated the library with the latest changes?
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
@ianscrivener Yes, I've updated the library.
The solution was to pass n_gpu_layers=1 into the constructor:
Llama(model_path=llama_path, n_gpu_layers=1)
Without that the model doesn't use GPU. Sorry for the false alarm.
Great... working beautifully now. 🤙 Good work all!! 🏆 Many thanks 🙏
On Tue, 13 Jun 2023, at 06:36, Peter Gagarinov wrote:
@ianscrivener https://github.com/ianscrivener Yes, I've updated the library.
The solution was to pass n_gpu_layers=1 into the constructor:
Llama(model_path=llama_path, n_gpu_layers=1)
Without that the model doesn't use GPU. Sorry for the false alarm.— Reply to this email directly, view it on GitHub https://github.com/abetlen/llama-cpp-python/issues/317#issuecomment-1588054299, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABH3HPWX5VZN2T7RTD5GSDXK54UVANCNFSM6AAAAAAY2N3IJ4. You are receiving this because you were mentioned.Message ID: @.***>
Set n_gpu_layers=1000
to move all LLM layers to the GPU. Only reduce this number to less than the number of layers the LLM has if you are running low on GPU memory.
I see that MPS being used:
llama_init_from_file: kv self size = 6093.75 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '~/.pyenv/versions/mambaforge/envs/gptwizards/lib/python3.10/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x12ee8d010
ggml_metal_init: loaded kernel_mul 0x12ee8d270
ggml_metal_init: loaded kernel_mul_row 0x12ee8d4d0
ggml_metal_init: loaded kernel_scale 0x12ee8d730
ggml_metal_init: loaded kernel_silu 0x12ee8d990
ggml_metal_init: loaded kernel_relu 0x12ee8dbf0
ggml_metal_init: loaded kernel_gelu 0x12ee8de50
ggml_metal_init: loaded kernel_soft_max 0x12ee8e0b0
ggml_metal_init: loaded kernel_diag_mask_inf 0x12ee8e310
ggml_metal_init: loaded kernel_get_rows_f16 0x12ee8e570
ggml_metal_init: loaded kernel_get_rows_q4_0 0x12ee8e7d0
ggml_metal_init: loaded kernel_get_rows_q4_1 0x12ee8ea30
ggml_metal_init: loaded kernel_get_rows_q2_k 0x12ee8ec90
ggml_metal_init: loaded kernel_get_rows_q3_k 0x12ee8eef0
ggml_metal_init: loaded kernel_get_rows_q4_k 0x12ee8f150
ggml_metal_init: loaded kernel_get_rows_q5_k 0x12ee8f3b0
ggml_metal_init: loaded kernel_get_rows_q6_k 0x12ee8f610
ggml_metal_init: loaded kernel_rms_norm 0x12ee8f870
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x12ee8fe10
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x14b337050
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x14b337470
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32 0x14b337890
ggml_metal_init: loaded kernel_mul_mat_q3_k_f32 0x14b337cd0
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32 0x14b338390
ggml_metal_init: loaded kernel_mul_mat_q5_k_f32 0x14b3388d0
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32 0x14b338e10
ggml_metal_init: loaded kernel_rope 0x14b339560
ggml_metal_init: loaded kernel_cpy_f32_f16 0x14b339e50
ggml_metal_init: loaded kernel_cpy_f32_f32 0x14b33a540
ggml_metal_add_buffer: allocated 'data ' buffer, size = 14912.78 MB
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1280.00 MB
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 6095.75 MB
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 512.00 MB
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
However in the activity monitor the GPU usage is 0%, can someone advise please?
My tests with llama2 7B, 13B and 70B models in my Mac M1 64GB RAM here: https://github.com/ggerganov/llama.cpp/issues/2508#issuecomment-1681658567
Summary of results:
llama-2-7b-chat.ggmlv3.q8_0.bin
, llama-2-13b-chat.ggmlv3.q8_0.bin
and llama-2-70b-chat.ggmlv3.q4_0.bin
work with CPU (do not forget the paramter n_gqa = 8
for the 70B model)llama-2-7b-chat.ggmlv3.q4_0.bin
, llama-2-13b-chat.ggmlv3.q4_0.bin
work with GPU metal.llama-2-13b-chat.ggmlv3.q8_0.bin
, llama-2-70b-chat.ggmlv3.q4_0.bin
does not work with GPU.@karrtikiyer The following code runs in my M1 MPS 64GB GPU metal 32 cores:
Model:
llama-2-13b-chat.ggmlv3.q4_0.bin
Installation:
conda create -n llamaM1 python=3.9.16
conda activate llamaM1
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
python testM1llama.py
Working code for M1 metal GPU:
from llama_cpp import Llama
model_path = './llama-2-13b-chat.ggmlv3.q4_0.bin'
lm = Llama(model_path,
n_ctx = 2048,
n_gpu_layers = 600)
output = lm("Provide a Python function that gets input a positive integer and output a list of it prime factors.",
max_tokens = 1000,
stream = True)
for token in output:
print(token['choices'][0]['text'], end='', flush=True)
Code output:
llama.cpp: loading model from ./llama-2-13b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_head_kv = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-06
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.11 MB
llama_model_load_internal: mem required = 7477.72 MB (+ 1600.00 MB per state)
llama_new_context_with_model: kv self size = 1600.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/miniforge3/envs/llamaM1/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x15503f680
ggml_metal_init: loaded kernel_add_row 0x155041820
ggml_metal_init: loaded kernel_mul 0x123e08390
ggml_metal_init: loaded kernel_mul_row 0x123e089b0
ggml_metal_init: loaded kernel_scale 0x123e09d80
ggml_metal_init: loaded kernel_silu 0x123e0a410
ggml_metal_init: loaded kernel_relu 0x123e092d0
ggml_metal_init: loaded kernel_gelu 0x123e09530
ggml_metal_init: loaded kernel_soft_max 0x123e0b7d0
ggml_metal_init: loaded kernel_diag_mask_inf 0x1551795c0
ggml_metal_init: loaded kernel_get_rows_f16 0x155179980
ggml_metal_init: loaded kernel_get_rows_q4_0 0x15517ae20
ggml_metal_init: loaded kernel_get_rows_q4_1 0x155179be0
ggml_metal_init: loaded kernel_get_rows_q2_K 0x15517be50
ggml_metal_init: loaded kernel_get_rows_q3_K 0x153fa2b90
ggml_metal_init: loaded kernel_get_rows_q4_K 0x153fa3770
ggml_metal_init: loaded kernel_get_rows_q5_K 0x153fa8760
ggml_metal_init: loaded kernel_get_rows_q6_K 0x153fa8e50
ggml_metal_init: loaded kernel_rms_norm 0x153fa9540
ggml_metal_init: loaded kernel_norm 0x153fa9c70
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x153faa400
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x153faac80
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x153fab490
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x153fac590
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x153facd20
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x153fad4e0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x153fadc80
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x153fae9c0
ggml_metal_init: loaded kernel_rope 0x153faef00
ggml_metal_init: loaded kernel_alibi_f32 0x155040e90
ggml_metal_init: loaded kernel_cpy_f32_f16 0x155041d40
ggml_metal_init: loaded kernel_cpy_f32_f32 0x155042310
ggml_metal_init: loaded kernel_cpy_f16_f16 0x1550434c0
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 87.89 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 6984.06 MB, ( 6984.52 / 49152.00)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 12.00 MB, ( 6996.52 / 49152.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1602.00 MB, ( 8598.52 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 290.00 MB, ( 8888.52 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 192.00 MB, ( 9080.52 / 49152.00)
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
Example:
>>> factor(5)
[1, 5]
>>> factor(25)
[3, 5]
Note: The input integer is always positive, so you can assume that the input is a non-negative integer.
Here are some hints to help you write this function:
* A prime number is a positive integer that is divisible only by itself and 1.
* You can use the built-in `isprime` function from the `math.gcd` module to check if an integer is prime.
* You can use a loop to iterate over the range of possible divisors (2 to n/2, where n is the input integer) and check if each one is a factor.
* If you find a prime factor, you can add it to the list of factors and continue iterating until you have found all the prime factors.
def factor(n):
if n == 1:
return [1]
# recursive case: if n is not 1, find its prime factors and return a list of factors
factors = []
for i in range(2, int(n/2) + 1):
if n % i == 0:
factors.append(i)
n = n // i
while n % i == 0:
factors.append(i)
n = n // i
# check if n is prime, if it is, add it to the list of factors
if not any(x > 1 for x in factors):
factors.append(n)
return factors
This function uses a loop to iterate over the range of possible divisors (2 to n/2) and checks if each one is a factor. If a prime factor is found, it is added to the list of factors and the iteration continues until all prime factors are found. The function also checks if the input integer is prime, and if so, it adds it to the list of factors.
Here's an example of how the function works:
>>> factor(5)
[1, 5]
The function starts by checking if 5 is prime. Since it is not prime (5 % 2 == 0), it iterates over the range of possible divisors (2 to 5/2 + 1 = 3). It finds that 5 is divisible by 3, so it adds 3 to the list of factors and continues iterating until all prime factors are found. The final list of factors is [1, 3, 5].
Note that this function assumes that the input integer is non-negative. If you need to handle negative integers as well, you can modify the function accordingly
llama_print_timings: load time = 5757.52 ms
llama_print_timings: sample time = 1421.70 ms / 566 runs ( 2.51 ms per token, 398.12 tokens per second)
llama_print_timings: prompt eval time = 5757.49 ms / 22 tokens ( 261.70 ms per token, 3.82 tokens per second)
llama_print_timings: eval time = 22935.47 ms / 565 runs ( 40.59 ms per token, 24.63 tokens per second)
llama_print_timings: total time = 32983.57 ms
.
ggml_metal_free: deallocating
here's the documentation for installing for MacOS: https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md
seems you missed the pytorch step...
I have Macbook M1 Pro 16 GB RAM. I am trying to run the model on GPU using the below lines and it is working fine.
LLAMA_METAL=1 make -j && ./main -m ./models/llama-2-13b-chat.Q4_0.gguf -p "I believe the meaning of life is" --ignore-eos -n 64 -ngl 1
However, when trying to run it using the below code I received the below error
from llama_cpp import Llama
model_path = '/Users/asq/llama.cpp/models/llama-2-13b-chat.Q4_0.gguf'
lm = Llama(model_path,
n_ctx = 2048,
n_gpu_layers = 1)
output = lm("Give me a list of famous mathematicians between born from 1800 to 2000.",
max_tokens = 1000,
stream = True)
for token in output:
print(token['choices'][0]['text'], end='', flush=True)
below is the error. can anyone advise.
^
program_source:2349:66: error: explicit instantiation of 'kernel_mul_mm' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mm_q2_K_f32")]] kernel mat_mm_t kernel_mul_mm<block_q2_K, QK_NL, dequantize_q2_K>;
^
program_source:2350:66: error: explicit instantiation of 'kernel_mul_mm' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mm_q3_K_f32")]] kernel mat_mm_t kernel_mul_mm<block_q3_K, QK_NL, dequantize_q3_K>;
^
program_source:2351:66: error: explicit instantiation of 'kernel_mul_mm' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mm_q4_K_f32")]] kernel mat_mm_t kernel_mul_mm<block_q4_K, QK_NL, dequantize_q4_K>;
^
program_source:2352:66: error: explicit instantiation of 'kernel_mul_mm' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mm_q5_K_f32")]] kernel mat_mm_t kernel_mul_mm<block_q5_K, QK_NL, dequantize_q5_K>;
^
program_source:2353:66: error: explicit instantiation of 'kernel_mul_mm' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mm_q6_K_f32")]] kernel mat_mm_t kernel_mul_mm<block_q6_K, QK_NL, dequantize_q6_K>;
^
}
llama_new_context_with_model: ggml_metal_init() failed
Traceback (most recent call last):
File "/Users/asq/Documents/ML/Llama2-Chatbot-main/testMetal.py", line 4, in <module>
lm = Llama(model_path,
File "/Users/asq/opt/anaconda3/envs/llama/lib/python3.9/site-packages/llama_cpp/llama.py", line 350, in __init__
assert self.ctx is not None
AssertionError
@ahmed-man3, I just tested your python code. Works fine with _llama-2-7b-chat.ggmlv3.q6K.ggu on my Macbook M2 Pro 16 GB RAM.
Thoughts; (1) Make sure you (force) pull and install the latest llama-cpp-python (and hence llama.cpp), ie
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --no-cache-dir llama-cpp-python
pip install 'llama-cpp-python[server]'
(2) Did you download the .gguf model.. or convert? To rule a problem with the model I use gguf models from TheBloke - have had issues with models from others
@ianscrivener
Thank you for your prompt support. The required installation (1) has been completed as suggested. for the model, yes it has been downloaded as .gguf from theBloke using this link https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/tree/main
Unsure - perhaps try; 1) move the gguf to the python file (I have had issues with absolute paths) 2) see if the 13B model works with CPU only in llama-cpp-python 3) try llama-cpp-python with ctx 1096 4) try a different model - maybe llama-2-7b-chat.ggmlv3.q6_K.gguf 5) try a different python version - I'm using 3.10.12
Unsure - perhaps try;
- move the gguf to the python file (I have had issues with absolute paths)
- see if the 13B model works with CPU only in llama-cpp-python
- try llama-cpp-python with ctx 1096
- try a different model - maybe llama-2-7b-chat.ggmlv3.q6_K.gguf
- try a different python version - I'm using 3.10.12
Very much appreciated your support. It worked fine now by applying # 1 & 5. thank you
Good to hear. 🏆
Why we are giving "LLAMA_METAL=1"?
Previously LLAMA_METAL=1
was required for building for MacOS with Metal... but now Metal is enabled by default.
_"To disable the Metal build at compile time use the LLAMA_NO_METAL=1
flag or the LLAMA_METAL=OFF
cmake option"_
Previously
LLAMA_METAL=1
was required for building for MacOS with Metal... but now Metal is enabled by default._"To disable the Metal build at compile time use the
LLAMA_NO_METAL=1
flag or theLLAMA_METAL=OFF
cmake option"_
Thank you so much
The main llama.cpp has been updated to support GPUs on Mac's with the following flag (tested on my system):
LLAMA_METAL=1 make -j && ./main -m /Downloads/guanaco-65B.ggmlv3.q4_0.bin -p "I believe the meaning of life is" --ignore-eos -n 64 -ngl 1
It look like the following flag needs to be added to CMake options:
CMAKE_ARGS="LLAMA_METAL=1" FORCE_CMAKE=1 pip install -e .
While it appears that it installs successfully, the library cannot be loaded.
This happens regardless of whether the GitHub repo or PyPy are used.