Closed m828 closed 3 months ago
Same but with RX 7600 XT (gfx1102)
I'm also having the same issue when running latest Ollama build from source (97c20ed) on my RX 6700 XT on Ubuntu Server 22.04. I thought setting the override and target as pointed out elsewhere would fix it, but it didn't for me.
Running as root:
HSA_OVERRIDE_GFX_VERSION=10.3.0 AMDGPU_TARGETS=gfx1030 ROCM_PATH=/opt/rocm ./ollama serve
Output:
2024/07/18 10:17:41 routes.go:1091: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION:10.3.0 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-18T10:17:41.593Z level=INFO source=images.go:767 msg="total blobs: 5"
time=2024-07-18T10:17:41.593Z level=INFO source=images.go:774 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
- using env: export GIN_MODE=release
- using code: gin.SetMode(gin.ReleaseMode)
[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers)
[GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers)
[GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-07-18T10:17:41.593Z level=INFO source=routes.go:1138 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-07-18T10:17:41.593Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1222237880/runners
time=2024-07-18T10:17:41.768Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 rocm_v60103 cpu cpu_avx]"
time=2024-07-18T10:17:41.768Z level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-18T10:17:41.772Z level=INFO source=amd_linux.go:333 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION=10.3.0
time=2024-07-18T10:17:41.772Z level=INFO source=types.go:105 msg="inference compute" id=0 library=rocm compute=gfx1031 driver=6.7 name=1002:73df total="12.0 GiB" available="12.0 GiB"
[GIN] 2024/07/18 - 10:18:06 | 200 | 64.39µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/07/18 - 10:18:06 | 200 | 18.258624ms | 127.0.0.1 | POST "/api/show"
time=2024-07-18T10:18:06.239Z level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=0 parallel=4 available=12843397120 required="6.2 GiB"
time=2024-07-18T10:18:06.240Z level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[12.0 GiB]" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-07-18T10:18:06.240Z level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama1222237880/runners/rocm_v60103/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 37159"
time=2024-07-18T10:18:06.241Z level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-18T10:18:06.241Z level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-18T10:18:06.241Z level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3337 commit="a8db2a9c" tid="140617125921600" timestamp=1721297886
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="140617125921600" timestamp=1721297886 total_threads=12
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="37159" tid="140617125921600" timestamp=1721297886
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv 2: llama.block_count u32 = 32
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
time=2024-07-18T10:18:06.493Z level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6700 XT, compute capability 10.3, VMM: no
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: ROCm0 buffer size = 4155.99 MiB
llm_load_tensors: CPU buffer size = 281.81 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 1024.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 2.02 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 560.00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 24.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: invalid device function
current device: 0, in function ggml_cuda_compute_forward at /home/ccidral/dev/oss/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:2283
err
GGML_ASSERT: /home/ccidral/dev/oss/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:100: !"CUDA error"
[New LWP 6078]
[New LWP 6079]
[New LWP 6080]
[New LWP 6081]
[New LWP 6082]
[New LWP 6083]
[New LWP 6084]
[New LWP 6085]
[New LWP 6086]
[New LWP 6087]
[New LWP 6088]
[New LWP 6089]
[New LWP 6090]
time=2024-07-18T10:18:09.200Z level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server not responding"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fe4a621842f in __GI___wait4 (pid=6095, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007fe4a621842f in __GI___wait4 (pid=6095, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x000000000192c607 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
#2 0x0000000001930dca in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
#3 0x00000000018f53e8 in ggml_backend_sched_graph_compute_async ()
#4 0x0000000001a7e9f9 in llama_decode ()
#5 0x0000000001b6e0fb in llama_init_from_gpt_params(gpt_params&) ()
#6 0x00000000017fca80 in llama_server_context::load_model(gpt_params const&) ()
#7 0x00000000017e9ab2 in main ()
[Inferior 1 (process 6077) detached]
time=2024-07-18T10:18:09.466Z level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
time=2024-07-18T10:18:09.717Z level=ERROR source=sched.go:443 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) CUDA error\""
[GIN] 2024/07/18 - 10:18:09 | 500 | 3.522559149s | 127.0.0.1 | POST "/api/chat"
time=2024-07-18T10:18:14.718Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000848772 model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-18T10:18:14.967Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250591142 model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-18T10:18:15.217Z level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.50039131 model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
rocaminfo
output:
[37mROCk module version 6.7.0 is loaded[0m
=====================
HSA System Attributes
=====================
Runtime Version: 1.13
Runtime Ext Version: 1.4
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 5 5600 6-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 5 5600 6-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3500
BDFID: 0
Internal Node ID: 0
Compute Unit: 12
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32780668(0x1f4317c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32780668(0x1f4317c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32780668(0x1f4317c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1031
Uuid: GPU-XX
Marketing Name: AMD Radeon RX 6700 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 3072(0xc00) KB
L3: 98304(0x18000) KB
Chip ID: 29663(0x73df)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2725
BDFID: 11520
Internal Node ID: 1
Compute Unit: 40
SIMDs per CU: 2
Shader Engines: 2
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 118
SDMA engine uCode:: 80
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 12566528(0xbfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 12566528(0xbfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1031
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
@m828 I know you're not using Ollama but I hope this helps somehow. This was the missing piece for me: https://github.com/ollama/ollama/issues/3107#issuecomment-2133970898
- Adding gfx1031 to the list of supported GPUs in ollama\llm\generate\gen_windows.ps1
- Building ollama
Except instead of gen_windows.ps1
because I'm on Linux I changed gen_linux.sh
. Once I built Ollama, I ran:
HSA_OVERRIDE_GFX_VERSION=10.3.0 ROCM_PATH=/opt/rocm ./ollama serve
Same but with RX 7600 XT (gfx1102)
It appears I accidentally copied gfx1030 from provided build command when it's supposed to be gfx1102 for me. Changing it fixed the issue.
This issue was closed because it has been inactive for 14 days since being marked as stale.
What happened?
ggml_cuda_compute_forward: RMS_NORM failed CUDA error: invalid device function current device: 0, in function ggml_cuda_compute_forward at ggml/src/ggml-cuda.cu:2288 err GGML_ASSERT: ggml/src/ggml-cuda.cu:101: !"CUDA error" [New LWP 252] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007dc7bf87142f in __GI___wait4 (pid=255, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
0 0x00007dc7bf87142f in __GI___wait4 (pid=255, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
1 0x0000647041457f0b in ggml_print_backtrace ()
2 0x000064704132bb47 in ggml_cuda_error(char const, char const, char const, int, char const) ()
3 0x00006470413300ea in ggml_backend_cuda_graph_compute(ggml_backend, ggml_cgraph) ()
4 0x00006470414a41d6 in ggml_backend_sched_graph_compute_async ()
5 0x00006470414fdd7a in llama_decode ()
6 0x00006470415ca265 in llama_init_from_gpt_params(gpt_params&) ()
7 0x000064704131315e in main ()
[Inferior 1 (process 251) detached]
Name and Version
./llama-cli -m models/ggml-meta-llama-3-8b-Q4_K_M.gguf -p "You are a helpful assistant" -cnv -c 512 --n-gpu-layers 99 AMD Radeon RX 6700 XT According to the online process, the compiled environment(HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \ cmake -S . -B build -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \ && cmake --build build --config Release -- -j 16; export HSA_OVERRIDE_GFX_VERSION=10.3.0 export HIP_VISIBLE_DEVICES=0;)
What operating system are you seeing the problem on?
No response
Relevant log output
No response