[BUG] Segmentation Fault during `NexaVLMInference` Initialization on macOS Ventura

Ysf101 commented 2 weeks ago

Issue Description

I'm encountering a segmentation fault when initializing NexaVLMInference using the nexa.gguf package on macOS Ventura.

Steps to Reproduce

Create a virtual environment with Python 3.12 (architecture: x86_64).
Install nexaai via pip install nexaai.
Download the required models.
Run the following script:

from nexa.gguf import NexaVLMInference

image_inference = NexaVLMInference(
    model_path="llava-v1.6-vicuna-7b:q4_0",
    local_path=None,
    stop_words=[],
    temperature=0.3,
    max_new_tokens=3000,
    top_k=3,
    top_p=0.2,
    profiling=False
)

print("NexaVLMInference initialized successfully.")

Environment Details: macOS Ventura version: [e.g., 13.0] Python version: 3.12.0 nexaaiversion: [0.0.9.0] Architecture: x86_64


* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x48)
  * frame #0: 0x000000010885dcc4 libggml_llama.dylib`ggml_backend_buffer_is_host + 4
    frame #1: 0x0000000107f547e1 libllava.dylib`clip_model_load + 11969
    frame #2: 0x0000000105464972 libffi.8.dylib`ffi_call_unix64 + 82
    ...

OS

macOS ventura

Python Version

3.12.0

Nexa SDK Version

0.0.9.0

GPU (if using one)

No response

Ysf101 commented 2 weeks ago

it seems the issue related to memory handling, it happens when i give it a bit of large prompt

Ysf101 commented 2 weeks ago

Update i switched to using CPU and now models are initializing successfully but when i give a bit of a complex prompt i get the error: Process 836 stopped

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000000000000
error: memory read failed for 0x0
Target 0: (python) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000000000000
    frame #1: 0x00000001088c011d libggml_llama.dylib`ggml_backend_blas_graph_compute(ggml_backend*, ggml_cgraph*) + 3453
    frame #2: 0x0000000108886d67 libggml_llama.dylib`ggml_backend_sched_graph_compute_async + 1447
    frame #3: 0x00000001086cff11 libllama.dylib`llama_decode + 2609
    frame #4: 0x0000000105289972 libffi.8.dylib`ffi_call_unix64 + 82
    frame #5: 0x000000010528916b libffi.8.dylib`ffi_call_int + 827
    frame #6: 0x0000000105288d6b libffi.8.dylib`ffi_call + 219
    frame #7: 0x000000010635848e _ctypes.cpython-312-darwin.so`_ctypes_callproc + 638
    frame #8: 0x0000000106352437 _ctypes.cpython-312-darwin.so`PyCFuncPtr_call + 279
    frame #9: 0x0000000100206ee8 python`_PyEval_EvalFrameDefault + 265704
    frame #10: 0x000000010007eb8e python`_PyObject_Call_Prepend + 334
    frame #11: 0x00000001001174f9 python`slot_tp_call + 105
    frame #12: 0x000000010020c216 python`_PyEval_EvalFrameDefault + 286998
    frame #13: 0x00000001001c369d python`PyEval_EvalCode + 253
    frame #14: 0x0000000100290c50 python`run_mod + 272
    frame #15: 0x0000000100290a13 python`pyrun_file + 147
    frame #16: 0x00000001002903fe python`_PyRun_SimpleFileObject + 302
    frame #17: 0x000000010028fd02 python`_PyRun_AnyFileObject + 66
    frame #18: 0x00000001002bdc3c python`pymain_run_file_obj + 204
    frame #19: 0x00000001002bd6f9 python`pymain_run_file + 89
    frame #20: 0x00000001002bcd40 python`Py_RunMain + 1376
    frame #21: 0x00000001002bdf79 python`pymain_main + 505
    frame #22: 0x0000000100001279 python`main + 57
    frame #23: 0x00007ff811cac310 dyld`start + 2432
(lldb) ^D

zhiyuan8 commented 2 weeks ago

Hi @Ysf101 in llava architecture, one image take 729+ tokens, and with a long prompt, it is easy to have OOM error on mac laptop. We have addressed this issue and plan to propose a new multiodal model architecture to support it.

Seems that similar issue is also reported in llama.cpp https://github.com/ggerganov/llama.cpp/issues/4880

Ysf101 commented 2 weeks ago

Hi @zhiyuan8, Thank you for the information, Just to add to my initial bug report I've tried to run NexaAI with 5GB of RAM, then 12 GB of RAM and also tried to run it on my friend's laptop(we both run macOS in VMware in which we use NexaAI) and it produced the segmentation error on my friend laptop (same error i was having). The input does not need to be long like see in the below example:


(ysf) ysf@ysfs % nexa run Qwen2.5-0.5B-Instruct:q2_K
Model Qwen2.5-0.5B-Instruct:q2_K already exists at /Users/ysf/.cache/nexa/hub/official/Qwen2.5-0.5B-Instruct/q2_K.gguf
>>> hi
assistant: Hello! How can I help you today?

>>> hiiiii
assistant: Hello! How can I assist you today?

>>> hiiiiiiiiii
zsh: segmentation fault  nexa run Qwen2.5-0.5B-Instruct:q2_K

Meanwhile a 4GB RAM Windows running NexaAI in VMware runs all models perfectly, so it's the same machine and everything except OS.

Ysf101 commented 1 week ago

@zhiyuan8 an update, I upgraded to sequoia macOS and the issue was solved completely for text models but still in vision models.


(ysf) ysf@ysfs-Mac % nexa run nanollava     
Model nanoLLaVA:model-fp16 already exists at /Users/ysf/.cache/nexa/hub/official/nanoLLaVA/model-fp16.gguf
Model nanoLLaVA:projector-fp16 already exists at /Users/ysf/.cache/nexa/hub/official/nanoLLaVA/projector-fp16.gguf
⠋ zsh: segmentation fault  nexa run nanollava
(ysf) ysf@ysfs-Mac  %

NexaAI / nexa-sdk