abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.29k stars 868 forks source link

MPS Support - CMAKE_ARGS="LLAMA_METAL=1" #317

Closed leedrake5 closed 1 year ago

leedrake5 commented 1 year ago

The main llama.cpp has been updated to support GPUs on Mac's with the following flag (tested on my system):

LLAMA_METAL=1 make -j && ./main -m /Downloads/guanaco-65B.ggmlv3.q4_0.bin -p "I believe the meaning of life is" --ignore-eos -n 64 -ngl 1

It look like the following flag needs to be added to CMake options:

CMAKE_ARGS="LLAMA_METAL=1" FORCE_CMAKE=1 pip install -e .

While it appears that it installs successfully, the library cannot be loaded.

>>> from llama_cpp import Llama, LlamaCache
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/llama_cpp/__init__.py", line 1, in <module>
    from .llama_cpp import *
  File "/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/llama_cpp/llama_cpp.py", line 73, in <module>
    _lib = _load_shared_library(_lib_base_name)
  File "/opt/homebrew/Caskroom/mambaforge/base/lib/python3.10/site-packages/llama_cpp/llama_cpp.py", line 64, in _load_shared_library
    raise FileNotFoundError(
FileNotFoundError: Shared library with base name 'llama' not found

This happens regardless of whether the GitHub repo or PyPy are used.

zach-brockway commented 1 year ago

I was able to patch this locally:

# Load the library
def _load_shared_library(lib_base_name: str):
    # Determine the file extension based on the platform
    if sys.platform.startswith("linux"):
        lib_ext = ".so"
    elif sys.platform == "darwin":
        lib_ext = ".dylib" # <<< Was also ".so"

However, I still seem to get crashes trying to load models subsequently:

llama_model_load_internal: mem required  = 2532.67 MB (+ 3124.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  = 3120.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '(null)'
ggml_metal_init: error: Error Domain=NSCocoaErrorDomain Code=258 "The file name is invalid."
/Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/cpp/src/arrow/filesystem/s3fs.cc:2598:  arrow::fs::FinalizeS3 was not called even though S3 was initialized.  This could lead to a segmentation fault at exit

Edit: Best I can tell, it's failing to load the Metal shader for some reason, and it seems like that's supposed to be embedded into the dylib somehow?

    // read the source from "ggml-metal.metal" into a string and use newLibraryWithSource
    {
        NSError * error = nil;

        //NSString * path = [[NSBundle mainBundle] pathForResource:@"../../examples/metal/metal" ofType:@"metal"];
        NSString * path = [[NSBundle mainBundle] pathForResource:@"ggml-metal" ofType:@"metal"];
        fprintf(stderr, "%s: loading '%s'\n", __func__, [path UTF8String]);

        NSString * src  = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:&error];
        if (error) {
            fprintf(stderr, "%s: error: %s\n", __func__, [[error description] UTF8String]);
            exit(1);
        }
zach-brockway commented 1 year ago

"Fixed" the second issue by copying llama.cpp/ggml-metal.metal to the same directory as my python binary!

zach-brockway commented 1 year ago

So it seems like the upstream llama.cpp Makefile and CMakeLists disagree about what the extension of the shared library should be. Per this discussion, you can force libllama to be generated with the .so extension instead of .dylib by adding the MODULE keyword here:

add_library(llama MODULE
            llama.cpp
            llama.h
            llama-util.h
            )

Not clear to me if this might negatively impact other platforms, but it's enough to make FORCE_CMAKE builds generate the expected libllama.so, rather than a libllama.dylib that the Python has trouble finding.

leedrake5 commented 1 year ago

Thanks @zach-brockway, I can successfully get it to load with this bit:

# Load the library
def _load_shared_library(lib_base_name: str):
    # Determine the file extension based on the platform
    if sys.platform.startswith("linux"):
        lib_ext = ".so"
    elif sys.platform == "darwin":
        lib_ext = ".dylib" # <<< Was also ".so"

This goes into llama_cpp.py in the site-packages folder. However, it still only uses CPU, not GPU, even when I copy llama.cpp/ggml-metal.metal to site-packages/llama_cpp. I suspect it's because I don't know where this is suppose to go:

    // read the source from "ggml-metal.metal" into a string and use newLibraryWithSource
    {
        NSError * error = nil;

        //NSString * path = [[NSBundle mainBundle] pathForResource:@"../../examples/metal/metal" ofType:@"metal"];
        NSString * path = [[NSBundle mainBundle] pathForResource:@"ggml-metal" ofType:@"metal"];
        fprintf(stderr, "%s: loading '%s'\n", __func__, [path UTF8String]);

        NSString * src  = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:&error];
        if (error) {
            fprintf(stderr, "%s: error: %s\n", __func__, [[error description] UTF8String]);
            exit(1);
        }

Also where does this function go?

add_library(llama MODULE
            llama.cpp
            llama.h
            llama-util.h
            )
zach-brockway commented 1 year ago

Also where does this function go?

add_library(llama MODULE
            llama.cpp
            llama.h
            llama-util.h
            )

That was a modification I had to make to vendor/llama.cpp/CMakeLists.txt (line 412). The workarounds are starting to pile up at every level! But I'm pretty sure llama.cpp will in fact want to take a fix for this upstream. Either their static Makefile should output .dylib, or their CMakeLists.txt should output .so; no good reason for the current discrepancy.

leedrake5 commented 1 year ago

Many thanks, but worried this may be a dead end. If I use this version in CMakeLists.txt:

add_library(llama MODULE
            llama.cpp
            llama.h
            llama-util.h
            )

I get this error:

CMake Error at CMakeLists.txt:418 (target_include_directories):
  Cannot specify include directories for target "llama" which is not built by
  this project.
CMake Error at tests/CMakeLists.txt:4 (target_link_libraries):
  Target "llama" of type MODULE_LIBRARY may not be linked into another
  target.  One may link only to INTERFACE, OBJECT, STATIC or SHARED
  libraries, or to executables with the ENABLE_EXPORTS property set.
Call Stack (most recent call first):
  tests/CMakeLists.txt:9 (llama_add_test)

If I try to make it more explicit:

add_library(llama.so
            llama.cpp
            llama.h
            llama-util.h
            )

I get the same error. I really appreciate your help trying to workaround this, but I think you are right, this needs to happen upstream. It works fine by command line, but interfacing with the python package makes it very difficult.

abetlen commented 1 year ago

Sorry for the slow reply, I should be able to get access to an M1 tonight and get this sorted, cheers.

mhenrichsen commented 1 year ago

@abetlen sounds awesome. Please let me know if you're having issues and I'll let you ssh into one of mine :)

Jchang4 commented 1 year ago

@abetlen hey any updates? This would be an amazing update!

fungyeung commented 1 year ago

"Fixed" the second issue by copying llama.cpp/ggml-metal.metal to the same directory as my python binary!

I tried copying ggml-metal.metal to multiple locations but still got the "file name is invalid" error. Eventually, I "fixed" it by hardcoding the absolute path in vendor/llama.cpp/ggml-metal.m around line 101:

//NSString * path = [[NSBundle mainBundle] pathForResource:@"ggml-metal" ofType:@"metal"];
NSString * path = @"/path/to/vendor/llama.cpp/ggml-metal.metal";

Then recompile it.

abetlen commented 1 year ago

I added an option to llama_cpp.py to accept both .so and .dylib extensions on macos.

"Fixed" the second issue by copying llama.cpp/ggml-metal.metal to the same directory as my python binary!

@zach-brockway can you expand on this?

lucasquinteiro commented 1 year ago

@abetlen how should we install llama-cpp-python to make it work with LLAMA_METAL?

I added an option to llama_cpp.py to accept both .so and .dylib extensions on macos.

zach-brockway commented 1 year ago

"Fixed" the second issue by copying llama.cpp/ggml-metal.metal to the same directory as my python binary!

@zach-brockway can you expand on this?

Sure! So the ggml_metal_init errors I was receiving when attempting to load a model (loading '(null)' / error: Error Domain=NSCocoaErrorDomain Code=258 "The file name is invalid.") turned out to be attributable to the llama.cpp code I quoted in the edit to my first comment, where it tries to locate the ggml-metal.metal shader file using NSBundle pathForResource:ofType:.

To work around this, I ended up running the equivalent of the following command: cp vendor/llama.cpp/ggml-metal.metal $(dirname $(which python)) (the destination, in my case, was something like /opt/homebrew/Caskroom/miniconda/base/envs/mycondaenv/bin).

It seems like better upstream fixes might be something like having the shared library look alongside where it's located on disk, or ideally even embedding the shader into the dylib at build time somehow (since if the compiled code and shader get out of sync, that can also cause crashes).

fungyeung commented 1 year ago

"Fixed" the second issue by copying llama.cpp/ggml-metal.metal to the same directory as my python binary!

@zach-brockway can you expand on this?

Sure! So the ggml_metal_init errors I was receiving when attempting to load a model (loading '(null)' / error: Error Domain=NSCocoaErrorDomain Code=258 "The file name is invalid.") turned out to be attributable to the llama.cpp code I quoted in the edit to my first comment, where it tries to locate the ggml-metal.metal shader file using NSBundle pathForResource:ofType:.

To work around this, I ended up running the equivalent of the following command: cp vendor/llama.cpp/ggml-metal.metal $(dirname $(which python)) (the destination, in my case, was something like /opt/homebrew/Caskroom/miniconda/base/envs/mycondaenv/bin).

It seems like better upstream fixes might be something like having the shared library look alongside where it's located on disk, or ideally even embedding the shader into the dylib at build time somehow (since if the compiled code and shader get out of sync, that can also cause crashes).

Thanks, I tried to do the same. In my case my python is located at venv/bin/python so I copied ggml-metal.metal to venv/bin. It didn't work though. The only way I could make it work is to hardcode the NSString path in ggml-metal.metal.

zach-brockway commented 1 year ago

Thanks, I tried to do the same. In my case my python is located at venv/bin/python so I copied ggml-metal.metal to venv/bin. It didn't work though. The only way I could make it work is to hardcode the NSString path in ggml-metal.metal.

venv is a special case I think, the bin directory just contains symlinks to the underlying Python distribution that was active at the time you created the environment:

$ ls -lha python*
lrwxr-xr-x  1 zach  staff     7B May 21 22:33 python -> python3
lrwxr-xr-x  1 zach  staff    49B May 21 22:33 python3 -> /opt/homebrew/Caskroom/miniconda/base/bin/python3
lrwxr-xr-x  1 zach  staff     7B May 21 22:33 python3.9 -> python3

A workaround might be to use cp vendor/llama.cpp/ggml-metal.metal $(dirname $(realpath $(which python))), where realpath resolves the symlink first.

abetlen commented 1 year ago

@zach-brockway I think you're right that this requires a change to how llama.cpp is built as a shared library. I'll try to work on a PR for that but I only have remote access to a Mac so if anyone else is a better cmake ninja and has a mac in front of them I would really appreciate the help.

jacobfriedman commented 1 year ago

Trying to build as a shared object as part of another project yields this result. Best to ignore the problem with python and focus on the core issue. I'll see if I can wrangle a fix

ianscrivener commented 1 year ago

@zach-brockway I think you're right that this requires a change to how llama.cpp is built as a shared library. I'll try to work on a PR for that but I only have remote access to a Mac so if anyone else is a better cmake ninja and has a mac in front of them I would really appreciate the help.

@abetlen I've spent an hour or so doing different built variants trying to isolate (and fix?) this issue.. so far without success. I have had llama-cpp-python working a couple of times.. but haven't yet isolated the a reproducable/working install process for MacOS Metal.

abetlen commented 1 year ago

Just pushed v0.1.62 that includes Metal support, let me know if that works!

WojtekKowaluk commented 1 year ago

@abetlen: installed with CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python and it works:

INFO:Loading 7B...
INFO:llama.cpp weights detected: models/7B/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin

INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models/7B/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  = 1024.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/wojtek/Documents/text-generation-webui/venv/lib/python3.10/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x16a366ec0
ggml_metal_init: loaded kernel_mul                            0x16a3674f0
ggml_metal_init: loaded kernel_mul_row                        0x16a3678f0
ggml_metal_init: loaded kernel_scale                          0x16a367cf0
ggml_metal_init: loaded kernel_silu                           0x16a3680f0
ggml_metal_init: loaded kernel_relu                           0x16a3684f0
ggml_metal_init: loaded kernel_gelu                           0x16a3688f0
ggml_metal_init: loaded kernel_soft_max                       0x16a368e80
ggml_metal_init: loaded kernel_diag_mask_inf                  0x16a369280
ggml_metal_init: loaded kernel_get_rows_f16                   0x1209d5250
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x1209ecd30
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x1209edb70
ggml_metal_init: loaded kernel_get_rows_q2_k                  0x1209ee300
ggml_metal_init: loaded kernel_get_rows_q4_k                  0x16a369680
ggml_metal_init: loaded kernel_get_rows_q6_k                  0x16a369be0
ggml_metal_init: loaded kernel_rms_norm                       0x16a36a170
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x16a36a8b0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x16a36aff0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x1209ee8a0
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32               0x1209ef0d0
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32               0x1209ef670
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32               0x16a36b730
ggml_metal_init: loaded kernel_rope                           0x16a36bf00
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x16a36c7f0
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x1209efc40
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  3616.08 MB
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =   768.00 MB
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1026.00 MB
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
INFO:Loaded the model in 1.04 seconds.

llama_print_timings:        load time =  6887.87 ms
llama_print_timings:      sample time =   133.05 ms /    77 runs   (    1.73 ms per token)
llama_print_timings: prompt eval time =  6887.83 ms /    16 tokens (  430.49 ms per token)
llama_print_timings:        eval time =  8762.61 ms /    76 runs   (  115.30 ms per token)
llama_print_timings:       total time = 16282.09 ms
Output generated in 16.53 seconds (4.60 tokens/s, 76 tokens, context 16, seed 1703054888)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  6887.87 ms
llama_print_timings:      sample time =   229.93 ms /    77 runs   (    2.99 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  7226.16 ms /    77 runs   (   93.85 ms per token)
llama_print_timings:       total time =  8139.00 ms
Output generated in 8.44 seconds (9.01 tokens/s, 76 tokens, context 16, seed 1286945878)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  6887.87 ms
llama_print_timings:      sample time =   133.86 ms /    77 runs   (    1.74 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  7927.91 ms /    77 runs   (  102.96 ms per token)
llama_print_timings:       total time =  8573.75 ms
Output generated in 8.84 seconds (8.60 tokens/s, 76 tokens, context 16, seed 708232749)
pgagarinov commented 1 year ago

@abetlen running CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python inside a virtual environment or inside conda environment doesn't solve the problem - the model still only uses CPU:

llama.cpp: loading model from /Users/peter/_Git/_GPT/llama.cpp/models/wizardLM-7B.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.72 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

llama_print_timings:        load time =   634.15 ms
llama_print_timings:      sample time =   229.50 ms /   333 runs   (    0.69 ms per token)
llama_print_timings: prompt eval time =   634.07 ms /    11 tokens (   57.64 ms per token)
llama_print_timings:        eval time = 13948.15 ms /   332 runs   (   42.01 ms per token)
llama_print_timings:       total time = 16233.21 ms
lucasquinteiro commented 1 year ago

@abetlen running CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python inside a virtual environment or inside conda environment doesn't solve the problem - the model still only uses CPU:

llama.cpp: loading model from /Users/peter/_Git/_GPT/llama.cpp/models/wizardLM-7B.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.72 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

llama_print_timings:        load time =   634.15 ms
llama_print_timings:      sample time =   229.50 ms /   333 runs   (    0.69 ms per token)
llama_print_timings: prompt eval time =   634.07 ms /    11 tokens (   57.64 ms per token)
llama_print_timings:        eval time = 13948.15 ms /   332 runs   (   42.01 ms per token)
llama_print_timings:       total time = 16233.21 ms

Have you updated the library with the latest changes?

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir

pgagarinov commented 1 year ago

@ianscrivener Yes, I've updated the library.

The solution was to pass n_gpu_layers=1 into the constructor:

Llama(model_path=llama_path, n_gpu_layers=1)

Without that the model doesn't use GPU. Sorry for the false alarm.

ianscrivener commented 1 year ago

Great... working beautifully now. 🤙 Good work all!! 🏆 Many thanks 🙏

On Tue, 13 Jun 2023, at 06:36, Peter Gagarinov wrote:

@ianscrivener https://github.com/ianscrivener Yes, I've updated the library.

The solution was to pass n_gpu_layers=1 into the constructor:

Llama(model_path=llama_path, n_gpu_layers=1) Without that the model doesn't use GPU. Sorry for the false alarm.

— Reply to this email directly, view it on GitHub https://github.com/abetlen/llama-cpp-python/issues/317#issuecomment-1588054299, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABH3HPWX5VZN2T7RTD5GSDXK54UVANCNFSM6AAAAAAY2N3IJ4. You are receiving this because you were mentioned.Message ID: @.***>

gjmulder commented 1 year ago

Set n_gpu_layers=1000 to move all LLM layers to the GPU. Only reduce this number to less than the number of layers the LLM has if you are running low on GPU memory.

karrtikiyer commented 1 year ago

I see that MPS being used:

llama_init_from_file: kv self size  = 6093.75 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '~/.pyenv/versions/mambaforge/envs/gptwizards/lib/python3.10/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x12ee8d010
ggml_metal_init: loaded kernel_mul                            0x12ee8d270
ggml_metal_init: loaded kernel_mul_row                        0x12ee8d4d0
ggml_metal_init: loaded kernel_scale                          0x12ee8d730
ggml_metal_init: loaded kernel_silu                           0x12ee8d990
ggml_metal_init: loaded kernel_relu                           0x12ee8dbf0
ggml_metal_init: loaded kernel_gelu                           0x12ee8de50
ggml_metal_init: loaded kernel_soft_max                       0x12ee8e0b0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x12ee8e310
ggml_metal_init: loaded kernel_get_rows_f16                   0x12ee8e570
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x12ee8e7d0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x12ee8ea30
ggml_metal_init: loaded kernel_get_rows_q2_k                  0x12ee8ec90
ggml_metal_init: loaded kernel_get_rows_q3_k                  0x12ee8eef0
ggml_metal_init: loaded kernel_get_rows_q4_k                  0x12ee8f150
ggml_metal_init: loaded kernel_get_rows_q5_k                  0x12ee8f3b0
ggml_metal_init: loaded kernel_get_rows_q6_k                  0x12ee8f610
ggml_metal_init: loaded kernel_rms_norm                       0x12ee8f870
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x12ee8fe10
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x14b337050
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x14b337470
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32               0x14b337890
ggml_metal_init: loaded kernel_mul_mat_q3_k_f32               0x14b337cd0
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32               0x14b338390
ggml_metal_init: loaded kernel_mul_mat_q5_k_f32               0x14b3388d0
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32               0x14b338e10
ggml_metal_init: loaded kernel_rope                           0x14b339560
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x14b339e50
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x14b33a540
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 14912.78 MB
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1280.00 MB
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  6095.75 MB
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

However in the activity monitor the GPU usage is 0%, can someone advise please?

Screenshot 2023-06-16 at 8 26 25 PM
alexshmmy commented 11 months ago

My tests with llama2 7B, 13B and 70B models in my Mac M1 64GB RAM here: https://github.com/ggerganov/llama.cpp/issues/2508#issuecomment-1681658567

Summary of results:

alexshmmy commented 11 months ago

@karrtikiyer The following code runs in my M1 MPS 64GB GPU metal 32 cores:

Screenshot 2023-08-18 at 16 04 46

Model:

Installation:

conda create -n llamaM1 python=3.9.16
conda activate llamaM1
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
python testM1llama.py

Working code for M1 metal GPU:

from llama_cpp import Llama

model_path = './llama-2-13b-chat.ggmlv3.q4_0.bin'
lm = Llama(model_path,
             n_ctx = 2048,
             n_gpu_layers = 600)

output = lm("Provide a Python function that gets input a positive integer and output a list of it prime factors.",
              max_tokens = 1000, 
              stream = True)

for token in output:
    print(token['choices'][0]['text'], end='', flush=True)

Code output:

llama.cpp: loading model from ./llama-2-13b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: mem required  = 7477.72 MB (+ 1600.00 MB per state)
llama_new_context_with_model: kv self size  = 1600.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/miniforge3/envs/llamaM1/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x15503f680
ggml_metal_init: loaded kernel_add_row                        0x155041820
ggml_metal_init: loaded kernel_mul                            0x123e08390
ggml_metal_init: loaded kernel_mul_row                        0x123e089b0
ggml_metal_init: loaded kernel_scale                          0x123e09d80
ggml_metal_init: loaded kernel_silu                           0x123e0a410
ggml_metal_init: loaded kernel_relu                           0x123e092d0
ggml_metal_init: loaded kernel_gelu                           0x123e09530
ggml_metal_init: loaded kernel_soft_max                       0x123e0b7d0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x1551795c0
ggml_metal_init: loaded kernel_get_rows_f16                   0x155179980
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x15517ae20
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x155179be0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x15517be50
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x153fa2b90
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x153fa3770
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x153fa8760
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x153fa8e50
ggml_metal_init: loaded kernel_rms_norm                       0x153fa9540
ggml_metal_init: loaded kernel_norm                           0x153fa9c70
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x153faa400
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x153faac80
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x153fab490
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x153fac590
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x153facd20
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x153fad4e0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x153fadc80
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x153fae9c0
ggml_metal_init: loaded kernel_rope                           0x153faef00
ggml_metal_init: loaded kernel_alibi_f32                      0x155040e90
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x155041d40
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x155042310
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x1550434c0
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =    87.89 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6984.06 MB, ( 6984.52 / 49152.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    12.00 MB, ( 6996.52 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1602.00 MB, ( 8598.52 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   290.00 MB, ( 8888.52 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   192.00 MB, ( 9080.52 / 49152.00)
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 

Example:

>>> factor(5)
[1, 5]

>>> factor(25)
[3, 5]

Note: The input integer is always positive, so you can assume that the input is a non-negative integer.

Here are some hints to help you write this function:

* A prime number is a positive integer that is divisible only by itself and 1.
* You can use the built-in `isprime` function from the `math.gcd` module to check if an integer is prime.
* You can use a loop to iterate over the range of possible divisors (2 to n/2, where n is the input integer) and check if each one is a factor.
* If you find a prime factor, you can add it to the list of factors and continue iterating until you have found all the prime factors.

def factor(n):

base case: if n = 1, return [1]

if n == 1:
    return [1]

# recursive case: if n is not 1, find its prime factors and return a list of factors
factors = []
for i in range(2, int(n/2) + 1):
    if n % i == 0:
        factors.append(i)
        n = n // i
        while n % i == 0:
            factors.append(i)
            n = n // i

# check if n is prime, if it is, add it to the list of factors
if not any(x > 1 for x in factors):
    factors.append(n)

return factors

This function uses a loop to iterate over the range of possible divisors (2 to n/2) and checks if each one is a factor. If a prime factor is found, it is added to the list of factors and the iteration continues until all prime factors are found. The function also checks if the input integer is prime, and if so, it adds it to the list of factors.
Here's an example of how the function works:
>>> factor(5)
[1, 5]

The function starts by checking if 5 is prime. Since it is not prime (5 % 2 == 0), it iterates over the range of possible divisors (2 to 5/2 + 1 = 3). It finds that 5 is divisible by 3, so it adds 3 to the list of factors and continues iterating until all prime factors are found. The final list of factors is [1, 3, 5].
Note that this function assumes that the input integer is non-negative. If you need to handle negative integers as well, you can modify the function accordingly

llama_print_timings:        load time =  5757.52 ms
llama_print_timings:      sample time =  1421.70 ms /   566 runs   (    2.51 ms per token,   398.12 tokens per second)
llama_print_timings: prompt eval time =  5757.49 ms /    22 tokens (  261.70 ms per token,     3.82 tokens per second)
llama_print_timings:        eval time = 22935.47 ms /   565 runs   (   40.59 ms per token,    24.63 tokens per second)
llama_print_timings:       total time = 32983.57 ms
.

ggml_metal_free: deallocating
ianscrivener commented 10 months ago

here's the documentation for installing for MacOS: https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md

seems you missed the pytorch step...

ahmed-man3 commented 10 months ago

I have Macbook M1 Pro 16 GB RAM. I am trying to run the model on GPU using the below lines and it is working fine. LLAMA_METAL=1 make -j && ./main -m ./models/llama-2-13b-chat.Q4_0.gguf -p "I believe the meaning of life is" --ignore-eos -n 64 -ngl 1

However, when trying to run it using the below code I received the below error

from llama_cpp import Llama

model_path = '/Users/asq/llama.cpp/models/llama-2-13b-chat.Q4_0.gguf'
lm = Llama(model_path,
             n_ctx = 2048,
             n_gpu_layers = 1)

output = lm("Give me a list of famous mathematicians between born from 1800 to 2000.",
              max_tokens = 1000, 
              stream = True)

for token in output:
    print(token['choices'][0]['text'], end='', flush=True)

below is the error. can anyone advise.

                                                                 ^
program_source:2349:66: error: explicit instantiation of 'kernel_mul_mm' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mm_q2_K_f32")]] kernel mat_mm_t kernel_mul_mm<block_q2_K, QK_NL, dequantize_q2_K>;
                                                                 ^
program_source:2350:66: error: explicit instantiation of 'kernel_mul_mm' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mm_q3_K_f32")]] kernel mat_mm_t kernel_mul_mm<block_q3_K, QK_NL, dequantize_q3_K>;
                                                                 ^
program_source:2351:66: error: explicit instantiation of 'kernel_mul_mm' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mm_q4_K_f32")]] kernel mat_mm_t kernel_mul_mm<block_q4_K, QK_NL, dequantize_q4_K>;
                                                                 ^
program_source:2352:66: error: explicit instantiation of 'kernel_mul_mm' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mm_q5_K_f32")]] kernel mat_mm_t kernel_mul_mm<block_q5_K, QK_NL, dequantize_q5_K>;
                                                                 ^
program_source:2353:66: error: explicit instantiation of 'kernel_mul_mm' does not refer to a function template, variable template, member function, member class, or static data member
template [[host_name("kernel_mul_mm_q6_K_f32")]] kernel mat_mm_t kernel_mul_mm<block_q6_K, QK_NL, dequantize_q6_K>;
                                                                 ^
}
llama_new_context_with_model: ggml_metal_init() failed
Traceback (most recent call last):
  File "/Users/asq/Documents/ML/Llama2-Chatbot-main/testMetal.py", line 4, in <module>
    lm = Llama(model_path,
  File "/Users/asq/opt/anaconda3/envs/llama/lib/python3.9/site-packages/llama_cpp/llama.py", line 350, in __init__
    assert self.ctx is not None
AssertionError
ianscrivener commented 10 months ago

@ahmed-man3, I just tested your python code. Works fine with _llama-2-7b-chat.ggmlv3.q6K.ggu on my Macbook M2 Pro 16 GB RAM.

Thoughts; (1) Make sure you (force) pull and install the latest llama-cpp-python (and hence llama.cpp), ie

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --no-cache-dir llama-cpp-python
pip install 'llama-cpp-python[server]'

(2) Did you download the .gguf model.. or convert? To rule a problem with the model I use gguf models from TheBloke - have had issues with models from others

ahmed-man3 commented 10 months ago

@ianscrivener

Thank you for your prompt support. The required installation (1) has been completed as suggested. for the model, yes it has been downloaded as .gguf from theBloke using this link https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/tree/main

ianscrivener commented 10 months ago

Unsure - perhaps try; 1) move the gguf to the python file (I have had issues with absolute paths) 2) see if the 13B model works with CPU only in llama-cpp-python 3) try llama-cpp-python with ctx 1096 4) try a different model - maybe llama-2-7b-chat.ggmlv3.q6_K.gguf 5) try a different python version - I'm using 3.10.12

ahmed-man3 commented 10 months ago

Unsure - perhaps try;

  1. move the gguf to the python file (I have had issues with absolute paths)
  2. see if the 13B model works with CPU only in llama-cpp-python
  3. try llama-cpp-python with ctx 1096
  4. try a different model - maybe llama-2-7b-chat.ggmlv3.q6_K.gguf
  5. try a different python version - I'm using 3.10.12

Very much appreciated your support. It worked fine now by applying # 1 & 5. thank you

ianscrivener commented 10 months ago

Good to hear. 🏆

shrijayan commented 8 months ago

Why we are giving "LLAMA_METAL=1"?

ianscrivener commented 8 months ago

Previously LLAMA_METAL=1 was required for building for MacOS with Metal... but now Metal is enabled by default.

_"To disable the Metal build at compile time use the LLAMA_NO_METAL=1 flag or the LLAMA_METAL=OFF cmake option"_

shrijayan commented 8 months ago

Previously LLAMA_METAL=1 was required for building for MacOS with Metal... but now Metal is enabled by default.

_"To disable the Metal build at compile time use the LLAMA_NO_METAL=1 flag or the LLAMA_METAL=OFF cmake option"_

Thank you so much