Bug: `illegal hardware instruction` when running on M3 mac Sequoia installed with brew

Ben-Epstein commented 1 week ago

What happened?

When I run llama-cli -m ./Phi-3.5-mini-instruct-Q6_K_L.gguf -p "I believe the meaning of life is" -n 128 I get an error

build: 3829 (44f59b43) with Apple clang version 15.0.0 (clang-1500.3.9.4) for x86_64-apple-darwin23.4.0
main: llama backend init
[1]    36447 illegal hardware instruction  llama-cli -m  -p "I believe the meaning of life is" -n 128

I did a fresh install with homebrew, but got the same repeatedly. Seems the same as https://github.com/ggerganov/llama.cpp/issues/8065, but that issue was closed so I wanted to have an open one to track

Name and Version

❯ llama-cli --version
version: 3829 (44f59b43)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for x86_64-apple-darwin23.4.0

What operating system are you seeing the problem on?

Mac

Relevant log output

No additional log output available, even with -v

Ben-Epstein commented 1 week ago

When I install with nix, it is able to run. I get a different error, which I dont think should happen, since I am able to run this model using gpt4all without memory issues

ggml_backend_metal_buffer_type_alloc_buffer: error: failed to allocate buffer, size =  8484.02 MiB
llama_new_context_with_model: failed to allocate compute buffers
ggml_metal_free: deallocating
llama_init_from_gpt_params: failed to create context with model 'Phi-3.5-mini-instruct-IQ2_M.gguf'
main: error: unable to load model

But that's a different subject

jpohhhh commented 1 week ago

It looks like there's two different models used, without a source for either, so it'll be hard to reproduce.

If that's the only output from llama.cpp, I'm guessing the issue is you're running it on M3 but the llama.cpp is built for intel -- c.f. "x86_64" in "x86_64-apple-darwin23.4.0".

ngxson commented 1 week ago

For mac M3, you must build for ARM target. Otherwise, x86_64 binary will under Rosetta, which cause illegal instruction whenever it sees any AVX instruction.

llama_new_context_with_model: failed to allocate compute buffers

You need to specify the context length, -c 4096 for example

ggerganov / llama.cpp