ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.89k stars 9.73k forks source link

Bug: All SYCL builds since b3987 or so are unstable #10323

Open 0xDEADFED5 opened 5 hours ago

0xDEADFED5 commented 5 hours ago

What happened?

b4081 no longer works with this command line: llama-server.exe -t 16 --threads-http 8 --mlock -ngl 99 -m C:\LLM\Qwen2.5-3B-Instruct_Q4_1.gguf --port 8888 --ctx-size 112000 -np 48 --sampling-seq mt --min-p 0.1 --temp 1.5 -dt .1 --batch-size 2000

something changed because it works with earlier builds:

[SYCL] call ggml_check_sycl
llama_new_context_with_model: n_seq_max     = 48
llama_new_context_with_model: n_ctx         = 112000
llama_new_context_with_model: n_ctx_per_seq = 2333
llama_new_context_with_model: n_batch       = 2000
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (2333) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |
    |
|  |                   |                                       |       |compute|Max work|sub  |mem    |
    |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.5|    512|    1024|   32| 16704M|            1.3.31093|
llama_kv_cache_init:      SYCL0 KV buffer size =  3937.50 MiB
llama_new_context_with_model: KV self size  = 3937.50 MiB, K (f16): 1968.75 MiB, V (f16): 1968.75 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =    27.82 MiB
ggml_backend_sycl_buffer_type_alloc_buffer: can't malloc 3278637056 Bytes memory on deviceggml_gallocr_reserve_n: failed to allocate SYCL0 buffer of size 7573604352
ggml_backend_sycl_buffer_type_alloc_buffer: can't malloc 3278637056 Bytes memory on deviceggml_gallocr_reserve_n: failed to allocate SYCL0 buffer of size 7573604352
llama_new_context_with_model: failed to allocate compute buffers
common_init_from_params: failed to create context with model 'C:\LLM\Qwen2.5-3B-Instruct_Q4_1.gguf'
srv    load_model: failed to load model, 'C:\LLM\Qwen2.5-3B-Instruct_Q4_1.gguf'
main: exiting due to model loading error

there's this ongoing issue i can try to help debug if anyone has ideas: https://github.com/ggerganov/llama.cpp/issues/10184

i run SYCL build of llama-server 24x7, and it can't run overnight anymore since b3987 or so. if anyone has any suggestions or if there is anything i can do to help, please let me know.

Name and Version

ZE_LOADER_DEBUG_TRACE:Using Loader Library Path: ZE_LOADER_DEBUG_TRACE:Tracing Layer Library Path: ze_tracing_layer.dll ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no ggml_sycl_init: SYCL_USE_XMX: yes ggml_sycl_init: found 1 SYCL devices: version: 4081 (1607a5e5) built with MSVC 19.41.34123.0 for

What operating system are you seeing the problem on?

Windows

Hardware: Intel Arc A770 (16GB)

Relevant log output

No response

JohannesGaessler commented 5 hours ago

b3987 or so.

If you can, do a git bisect and identify the exact commit which introduced the problem. (I am not one of the devs working specifically on SYCL.)

0xDEADFED5 commented 3 hours ago

i'm suspecting commit c5b0f4b, i'm going to do further testing, but it might take a day to confirm.