Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
19.39k stars 982 forks source link

GGML_ASSERT: ggml-cuda.cu:9198: !"CUDA error" #403

Closed laooopooo closed 4 months ago

laooopooo commented 4 months ago

windows 10 llamafile-0.8.1

llamafile.exe -m Damysus-2.7B-Chat.Q8_0.gguf -ngl 9999 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA P104-100, compute capability 6.1, VMM: yes llm_load_tensors: ggml ctx size = 0.42 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 130.47 MiB llm_load_tensors: CUDA0 buffer size = 2684.12 MiB ............................................................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.20 MiB llama_new_context_with_model: CUDA0 compute buffer size = 108.24 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 6.01 MiB llama_new_context_with_model: graph nodes = 1161 llama_new_context_with_model: graph splits = 2 ggml-cuda.cu:1159: ERROR: CUDA kernel vec_dot_q8_0_q8_1_impl has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 500,600,700,750,800,900 ggml-cuda.cu:1159: ERROR: CUDA kernel vec_dot_q8_0_q8_1_impl has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 500,600,700,750,800,900 ggml-cuda.cu:1159: ERROR: CUDA kernel vec_dot_q8_0_q8_1_impl has no device code compatible with CUDA arch 600. ggml-cuda.cu was compiled for: 500,600,700,750,800,900 ... CUDA error: unspecified launch failure current device: 0, in function ggml_cuda_op_mul_mat at ggml-cuda.cu:10723 ggml_cuda_cpy_tensor_2d(src0_dd_i, src0, i03, i02/i02_divisor, dev[id].row_low, dev[id].row_high, stream) GGML_ASSERT: ggml-cuda.cu:9198: !"CUDA error"


but same env llamafile-0.8 is okey log-----------------------------------------------------

llamafile-0.8.exe -m Damysus-2.7B-Chat.Q8_0.gguf -ngl 9999 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA P104-100, compute capability 6.1, VMM: yes llm_load_tensors: ggml ctx size = 0.42 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 130.47 MiB llm_load_tensors: CUDA0 buffer size = 2684.12 MiB ............................................................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.20 MiB llama_new_context_with_model: CUDA0 compute buffer size = 108.24 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 6.01 MiB llama_new_context_with_model: graph nodes = 1161 llama_new_context_with_model: graph splits = 2 {"function":"initialize","level":"INFO","line":485,"msg":"initializing slots","n_slots":1,"tid":"9434528","timestamp":1715074257} {"function":"initialize","level":"INFO","line":494,"msg":"new slot","n_ctx_slot":512,"slot_id":0,"tid":"9434528","timestamp":1715074257} {"function":"server_cli","level":"INFO","line":3080,"msg":"model loaded","tid":"9434528","timestamp":1715074257}

llama server listening at http://127.0.0.1:8080

opening browser tab... (pass --nobrowser to disable) failed to open http://127.0.0.1:8080/ in a browser tab using /c/windows/explorer.exe: process exited with non-zero status {"function":"server_cli","hostname":"127.0.0.1","level":"INFO","line":3203,"msg":"HTTP server listening","port":"8080","tid":"9434528","timestamp":1715074258} {"function":"update_slots","level":"INFO","line":1639,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"9434528","timestamp":1715074258} {"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334534256","timestamp":1715074258} {"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/completion.js","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334538528","timestamp":1715074258} {"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/index.js","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334534256","timestamp":1715074258} {"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/json-schema-to-grammar.mjs","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334539824","timestamp":1715074258} {"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/history-template.txt","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334534256","timestamp":1715074258} {"function":"log_server_request","level":"INFO","line":2784,"method":"GET","msg":"request","params":{},"path":"/prompt-template.txt","remote_addr":"","remote_port":-1,"status":200,"tid":"17594334539824","timestamp":1715074258}

Janghou commented 4 months ago

FYI same error, trying llamafile (0.8.1) on a AMD 4800u / Linux Ubuntu 22.04 with ROCM: ./Phi-3-mini-4k-instruct.Q4_K_M.llamafile -ngl 9999

ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: invalid device function
  current device: 0, in function ggml_cuda_compute_forward at ggml-cuda.cu:11444
  err
GGML_ASSERT: ggml-cuda.cu:9198: !"CUDA error"
Janghou commented 4 months ago

Not exactly sure why, but after trying out and a reboot, it suddenly started working by setting this environment variable:

HSA_OVERRIDE_GFX_VERSION=9.0.0 ./Phi-3-mini-4k-instruct.Q6_K.llamafile -ngl 9999

That said, it's not really faster on iGPU with an AMD 4800u, but CPU usage is much lower (only 1 thread), so that's the win here.

FYI installed ROCM 6.1.1: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html

As it seems fx90c is officially not supported but by adding the override it just works.

The max UMA FB (GPU memory) size I can set in BIOS is 4GB, so it can run the Q6_K model.

laooopooo commented 4 months ago

I took a closer look at llama.cpp upgrade log (Flash Attention), this llama.cpp upgrade, for GPU to enable tensor core, so some time it will crash, llama.cpp is not enabled by default, but llamafile seems to be enabled by default, so it will throw a bug, but in the actual test, it is also related to GGUF files(It may have something to do with quantification), some enable FA and it will not collapse, and some will collapse, all in all, this upgrade is not very friendly, the problem of the new version of llamafile-0.8.2 still exists, and it seems that the developers have not seen this problem

jart commented 4 months ago

Thank you @laooopooo. Does it work for you if you add -DGGML_CUDA_FORCE_MMQ?

laooopooo commented 4 months ago

Thank you @laooopooo. Does it work for you if you add -DGGML_CUDA_FORCE_MMQ?

I referenced this:https://github.com/ggerganov/llama.cpp/issues/6529 set -arch=native,compiled,it work okay, set -arch=all,compiled,so big,but it work okay, set -arch=all-major,compiled,it not work, add or not add -DGGML_CUDA_FORCE_MMQ ,both it work Guess my card arch not include all-major Over,so close this topic,thank you

nvcc --shared ^
     -arch=native ^
     --forward-unknown-to-host-compiler ^
     -Xcompiler="/nologo /EHsc /O2 /GR /MT" ^
     -DNDEBUG ^
     -DGGML_BUILD=1 ^
     -DGGML_SHARED=1 ^
     -DGGML_CUDA_MMV_Y=1 ^
     -DGGML_CUDA_FORCE_MMQ ^
     -DGGML_MULTIPLATFORM ^
     -DGGML_CUDA_DMMV_X=32 ^
     -DK_QUANTS_PER_ITERATION=2 ^
     -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 ^
     -DGGML_MINIMIZE_CODE_SIZE ^
     -DGGML_USE_TINYBLAS ^
     -o ggml-cuda.dll ^
     ggml-cuda.cu ^
     -lcuda