abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.82k stars 934 forks source link

M1 Metal Initialization failing when torch is not imported #437

Open remixer-dec opened 1 year ago

remixer-dec commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

The model is loaded with Metal (MPS) initialization

Current Behavior

Metal initialization is failing with an error that it is not supported

Environment and Context

Darwin 21.4.0 Darwin Kernel Version 21.4.0: Fri Mar 18 00:46:32 PDT 2022; root:xnu-8020.101.4~15/RELEASE_ARM64_T6000 arm64 Python 3.10.12 GNU Make 4.3 Apple clang version 13.0.0 (clang-1300.0.29.30) llama-cpp-python==0.1.67

Failure Information (for bugs)

When I run the minimal code with n_gpu_layers>0 without importing pytorch on M1, python crashes with 'MPS not supported' error

Steps to Reproduce

Run this code:

from llama_cpp import Llama
# import torch
model = Llama(
  n_ctx=2000,
  n_gpu_layers=1,
  model_path='PATH/TO/GGML-MODEL.bin',
  seed=0
)

Now uncomment import torch and the bug is gone!
I was making a simple API server and I spent a few hours trying to understand why llama_cpp does not work in a simple script

Failure Logs

llama.cpp: loading model from /../model.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2000
llama_model_load_internal: n_embd     = 3200
llama_model_load_internal: n_mult     = 240
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 26
llama_model_load_internal: n_rot      = 100
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 8640
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size =    0.06 MB
llama_model_load_internal: mem required  = 2862.72 MB (+  682.00 MB per state)
llama_new_context_with_model: kv self size  =  634.77 MB
ggml_metal_init: allocating
ggml_metal_init: not using MPS
GGML_ASSERT: /private/var/folders/zk/hd0v0z2910x13xv8hq213c600000gn/T/pip-install-xd2ukar4/llama-cpp-python_22514c24c03b4751ada6d41fd839ed53/vendor/llama.cpp/ggml-metal.m:103: false && "MPS not supported"
[1]    16077 abort      python3.10 minimal.py
remixer-dec commented 1 year ago

Same bug appears when trying to run llama_cpp.server, since there is no import torch in server code.

HOST=localhost python3.10 -m llama_cpp.server --model ./PATH/TO/MODEL.gguf --n_gpu_layers 1 --n_ctx 2048 

But the error looks differently:

llama_new_context_with_model: max tensor size =   205.08 MB
ggml_metal_add_buffer: failed to allocate 'data            ' buffer, size =     0.00 MB
llama_new_context_with_model: failed to add buffer
ggml_metal_free: deallocating

also it says ggml_metal_init: hasUnifiedMemory = false and when pytorch is imported: ggml_metal_init: hasUnifiedMemory = true

bbernst commented 11 months ago

This was happening to me and adding import torch fixed it for me. Some extra info I was looking at before adding the import: 1) initially everything was working, then after running a notebook several times I started getting the failed to allocate error. 2) over a few runs it seems like the buffer size is adding up instead of overwriting it even though its the same model. 3) could there be a missing clear step that importing torch does that effectively clears this buffer?

First run:

ggml_metal_add_buffer: allocated 'data            ' buffer, size =  7339.34 MB, ( 7339.84 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   489.50 MB, ( 7829.34 / 49152.00)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =   275.38 MB, ( 8104.72 / 49152.00)

Second run:

ggml_metal_add_buffer: allocated 'data            ' buffer, size =  7339.34 MB, (15445.94 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   489.50 MB, (15935.44 / 49152.00)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =   275.38 MB, (16210.81 / 49152.00)
ggml_metal_free: deallocating

Third run:

ggml_metal_add_buffer: error: failed to allocate 'data            ' buffer, size =     0.00 MB
llama_new_context_with_model: failed to add buffer
ggml_metal_free: deallocating
remixer-dec commented 9 months ago

Possibly related: https://github.com/ggerganov/llama.cpp/discussions/3580

remixer-dec commented 7 months ago

Got a bit more info about this/similar issue:

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/opt/homebrew/lib/python3.10/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: error: Error Domain=MTLLibraryErrorDomain Code=3 "Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)" UserInfo={NSLocalizedDescription=Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)}
llama_new_context_with_model: failed to initialize Metal backend