CUDA error: unknown error when offloading to gfx1035

tdavie commented 8 months ago

I've encountered an error when offloading layers to the iGPU, on commit a919280 with an AMD 7735HS (680M) laptop running Fedora. I've seen a similar issue reported here, but I am using a different hardware configuration and see a distinct CUDA error.

rocm packages

``` $ dnf list --installed | grep "hip\|rocm" hip-devel.noarch 5.7.1-1.fc39 @updates hipblas.x86_64 5.7.1-1.fc40 @rawhide hipblas-devel.x86_64 5.7.1-1.fc40 @rawhide hipcc.noarch 5.7.1-1.fc39 @updates hsakmt.x86_64 1.0.6-34.rocm5.7.0.fc39 @updates hsakmt-devel.x86_64 1.0.6-34.rocm5.7.0.fc39 @updates rocm-cmake.noarch 5.7.0-1.fc39 @updates rocm-comgr.x86_64 17.0-3.fc39 @updates rocm-comgr-devel.x86_64 17.0-3.fc39 @updates rocm-device-libs.x86_64 17.1-1.fc39 @updates rocm-hip.x86_64 5.7.1-1.fc39 @updates rocm-hip-devel.x86_64 5.7.1-1.fc39 @updates rocm-rpm-macros-modules.x86_64 1.0-7.fc39 @updates rocm-runtime.x86_64 5.7.1-1.fc39 @updates rocm-runtime-devel.x86_64 5.7.1-1.fc39 @updates rocm-smi.x86_64 5.7.1-1.fc39 @updates rocminfo.x86_64 5.7.0-1.fc39 @updates ```

rocminfo

``` ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE Mwaitx: DISABLED DMAbuf Support: YES ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Ryzen 7 7735HS with Radeon Graphics Uuid: CPU-XX Marketing Name: AMD Ryzen 7 7735HS with Radeon Graphics Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 4829 BDFID: 0 Internal Node ID: 0 Compute Unit: 16 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 13982956(0xd55cec) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 13982956(0xd55cec) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 13982956(0xd55cec) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: gfx1035 Uuid: GPU-XX Marketing Name: AMD Radeon Graphics Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 2048(0x800) KB Chip ID: 5761(0x1681) ASIC Revision: 2(0x2) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2200 BDFID: 29184 Internal Node ID: 1 Compute Unit: 12 SIMDs per CU: 2 Shader Engines: 1 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 115 SDMA engine uCode:: 47 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 2097152(0x200000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 2097152(0x200000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1035 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ```

compilation

``` $ make -j LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1035 CC=/usr/bin/hipcc CXX=/usr/bin/clang++ I llama.cpp build info: I UNAME_S: Linux I UNAME_P: unknown I UNAME_M: x86_64 I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion I CXXFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi I NVCCFLAGS: I LDFLAGS: -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas I CC: HIP version: 5.7.31921- I CXX: clang version 17.0.6 (Fedora 17.0.6-1.fc39) /usr/bin/hipcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -c ggml.c -o ggml.o /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c llama.cpp -o llama.o /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/common.cpp -o common.o /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/sampling.cpp -o sampling.o /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/grammar-parser.cpp -o grammar-parser.o /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/console.cpp -o console.o /usr/bin/hipcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi --offload-arch=gfx1035 -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -x hip -c -o ggml-cuda.o ggml-cuda.cu /usr/bin/hipcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -c ggml-alloc.c -o ggml-alloc.o /usr/bin/hipcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -c ggml-backend.c -o ggml-backend.o /usr/bin/hipcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -c ggml-quants.c -o ggml-quants.o /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/train.cpp -o train.o /usr/bin/hipcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -c tests/test-c.c -o tests/test-c.o Use of uninitialized value $HIPCFLAGS in concatenation (.) or string at /usr/bin//hipcc.pl line 602. Use of uninitialized value $HIPCFLAGS in concatenation (.) or string at /usr/bin//hipcc.pl line 602. Use of uninitialized value $HIPCFLAGS in concatenation (.) or string at /usr/bin//hipcc.pl line 602. Use of uninitialized value $HIPCFLAGS in concatenation (.) or string at /usr/bin//hipcc.pl line 602. Use of uninitialized value $HIPCFLAGS in concatenation (.) or string at /usr/bin//hipcc.pl line 602. /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -c common/build-info.cpp -o build-info.o ggml.c:1203:5: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion] 1203 | GGML_F16_VEC_REDUCE(sumf, sum); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ggml.c:835:37: note: expanded from macro 'GGML_F16_VEC_REDUCE' 835 | #define GGML_F16_VEC_REDUCE GGML_F32Cx8_REDUCE | ^ ggml.c:825:33: note: expanded from macro 'GGML_F32Cx8_REDUCE' 825 | #define GGML_F32Cx8_REDUCE GGML_F32x8_REDUCE | ^ ggml.c:771:11: note: expanded from macro 'GGML_F32x8_REDUCE' 771 | res = _mm_cvtss_f32(_mm_hadd_ps(t1, t1)); \ | ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ggml.c:1251:9: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion] 1251 | GGML_F16_VEC_REDUCE(sumf[k], sum[k]); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ggml.c:835:37: note: expanded from macro 'GGML_F16_VEC_REDUCE' 835 | #define GGML_F16_VEC_REDUCE GGML_F32Cx8_REDUCE | ^ ggml.c:825:33: note: expanded from macro 'GGML_F32Cx8_REDUCE' 825 | #define GGML_F32Cx8_REDUCE GGML_F32x8_REDUCE | ^ ggml.c:771:11: note: expanded from macro 'GGML_F32x8_REDUCE' 771 | res = _mm_cvtss_f32(_mm_hadd_ps(t1, t1)); \ | ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ggml-cuda.cu:569:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn] 569 | } | ^ ggml-cuda.cu:809:20: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare] 809 | if (blockIdx.z < ne02) { // src0 | ~~~~~~~~~~ ^ ~~~~ ggml-cuda.cu:855:56: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare] 855 | if (nidx < ne00 && blockIdx.y < ne01 && blockIdx.z < ne02) { | ~~~~~~~~~~ ^ ~~~~ ggml-cuda.cu:855:35: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare] 855 | if (nidx < ne00 && blockIdx.y < ne01 && blockIdx.z < ne02) { | ~~~~~~~~~~ ^ ~~~~ ggml-cuda.cu:9899:103: warning: function 'ggml_backend_cuda_graph_plan_free' could be declared with attribute 'noreturn' [-Wmissing-noreturn] 9899 | static void ggml_backend_cuda_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { | ^ ggml-cuda.cu:9906:106: warning: function 'ggml_backend_cuda_graph_plan_compute' could be declared with attribute 'noreturn' [-Wmissing-noreturn] 9906 | static void ggml_backend_cuda_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { | ^ ggml-cuda.cu:10048:15: warning: 'break' will never be executed [-Wunreachable-code-break] 10048 | } break; | ^~~~~ ggml-cuda.cu:10041:15: warning: 'break' will never be executed [-Wunreachable-code-break] 10041 | } break; | ^~~~~ ggml-cuda.cu:10017:15: warning: 'break' will never be executed [-Wunreachable-code-break] 10017 | } break; | ^~~~~ ggml-cuda.cu:10002:15: warning: 'break' will never be executed [-Wunreachable-code-break] 10002 | } break; | ^~~~~ ggml-cuda.cu:9985:13: warning: 'break' will never be executed [-Wunreachable-code-break] 9985 | break; | ^~~~~ ggml-cuda.cu:4433:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] 4433 | mul_mat_q4_K( | ^ ggml-cuda.cu:4433:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] ggml-cuda.cu:4500:1: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] 4500 | mul_mat_q5_K( | ^ ggml-cuda.cu:4500:1: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] ggml-cuda.cu:4569:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] 4569 | mul_mat_q6_K( | ^ ggml-cuda.cu:4569:5: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning] 2 warnings generated. 17 warnings generated when compiling for gfx1035. ggml-cuda.cu:569:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn] 569 | } | ^ ggml-cuda.cu:809:20: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare] 809 | if (blockIdx.z < ne02) { // src0 | ~~~~~~~~~~ ^ ~~~~ ggml-cuda.cu:855:56: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare] 855 | if (nidx < ne00 && blockIdx.y < ne01 && blockIdx.z < ne02) { | ~~~~~~~~~~ ^ ~~~~ ggml-cuda.cu:855:35: warning: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'const int' [-Wsign-compare] 855 | if (nidx < ne00 && blockIdx.y < ne01 && blockIdx.z < ne02) { | ~~~~~~~~~~ ^ ~~~~ ggml-cuda.cu:9899:103: warning: function 'ggml_backend_cuda_graph_plan_free' could be declared with attribute 'noreturn' [-Wmissing-noreturn] 9899 | static void ggml_backend_cuda_graph_plan_free(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { | ^ ggml-cuda.cu:9906:106: warning: function 'ggml_backend_cuda_graph_plan_compute' could be declared with attribute 'noreturn' [-Wmissing-noreturn] 9906 | static void ggml_backend_cuda_graph_plan_compute(ggml_backend_t backend, ggml_backend_graph_plan_t plan) { | ^ ggml-cuda.cu:10048:15: warning: 'break' will never be executed [-Wunreachable-code-break] 10048 | } break; | ^~~~~ ggml-cuda.cu:10041:15: warning: 'break' will never be executed [-Wunreachable-code-break] 10041 | } break; | ^~~~~ ggml-cuda.cu:10017:15: warning: 'break' will never be executed [-Wunreachable-code-break] 10017 | } break; | ^~~~~ ggml-cuda.cu:10002:15: warning: 'break' will never be executed [-Wunreachable-code-break] 10002 | } break; | ^~~~~ ggml-cuda.cu:9985:13: warning: 'break' will never be executed [-Wunreachable-code-break] 9985 | break; | ^~~~~ 11 warnings generated when compiling for host. /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/main/main.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o main -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/quantize/quantize.cpp build-info.o ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o quantize -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/quantize-stats/quantize-stats.cpp build-info.o ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o quantize-stats -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/perplexity/perplexity.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o perplexity -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/embedding/embedding.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o embedding -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi pocs/vdot/vdot.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o vdot -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi pocs/vdot/q8dot.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o q8dot -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o train-text-from-scratch -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp ggml.o llama.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o convert-llama2c-to-ggml -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/simple/simple.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o simple -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/batched/batched.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o batched -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/batched-bench/batched-bench.cpp build-info.o ggml.o llama.o common.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o batched-bench -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/save-load-state/save-load-state.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o save-load-state -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -Iexamples/server examples/server/server.cpp examples/llava/clip.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o server -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas -Wno-cast-qual /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/gguf/gguf.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o gguf -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/llama-bench/llama-bench.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o llama-bench -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -static -fPIC -c examples/llava/llava.cpp -o libllava.a -Wno-cast-qual /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/llava/llava-cli.cpp examples/llava/clip.cpp examples/llava/llava.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o llava-cli -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas -Wno-cast-qual /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/baby-llama/baby-llama.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o baby-llama -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/beam-search/beam-search.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o beam-search -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/speculative/speculative.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o speculative -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/infill/infill.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o infill -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/tokenize/tokenize.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o tokenize -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/benchmark/benchmark-matmult.cpp build-info.o ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o benchmark-matmult -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/parallel/parallel.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o parallel -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/finetune/finetune.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o train.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o finetune -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/export-lora/export-lora.cpp ggml.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o export-lora -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/lookahead/lookahead.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o lookahead -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas /usr/bin/clang++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_HIPBLAS -DGGML_USE_CUBLAS -DGGML_HIP_UMA -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi examples/lookup/lookup.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o ggml-cuda.o ggml-alloc.o ggml-backend.o ggml-quants.o -o lookup -L/usr/lib -Wl,-rpath=/usr/lib -lhipblas -lamdhip64 -lrocblas ==== Run ./main -h for help. ==== ```

Running with any layers offloaded results in an error.

$ ./main -m ~/Downloads/starling-lm-7b-alpha.Q4_K_M.gguf -p "what is the meaning of life, the universe, and everything?" -ngl 1
Log start
main: build = 1766 (a919280)
main: built with HIP version: 5.7.31921- for x86_64-redhat-linux-gnu
main: seed  = 1704360510
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 10.3, VMM: no
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /home/user/Downloads/starling-lm-7b-alpha.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = berkeley-nest_starling-lm-7b-alpha
llama_model_loader: - kv   2:                       llama.context_length u32              = 8192
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) 
llm_load_print_meta: general.name     = berkeley-nest_starling-lm-7b-alpha
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|end_of_turn|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|end_of_turn|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: system memory used  = 4032.99 MiB
llm_load_tensors: VRAM used           =  132.50 MiB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/33 layers to GPU
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 2.00 MB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 76.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MiB
llama_new_context_with_model: total VRAM used: 207.50 MiB (model: 132.50 MiB, context: 75.00 MiB)
CUDA error: unknown error
  current device: 0, in function ggml_cuda_mul_mat_mat_batched_cublas at ggml-cuda.cu:8640
  hipblasGemmBatchedEx(g_cublas_handles[g_main_device], HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), HIPBLAS_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), HIPBLAS_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, HIPBLAS_GEMM_DEFAULT)
GGML_ASSERT: ggml-cuda.cu:226: !"CUDA error"
[New LWP 9043]
[New LWP 9045]
[New LWP 9046]
[New LWP 9047]
[New LWP 9048]
[New LWP 9049]
[New LWP 9050]
[New LWP 9051]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f469c71cd43 in wait4 () from /lib64/libc.so.6
#0  0x00007f469c71cd43 in wait4 () from /lib64/libc.so.6
#1  0x000000000054d505 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) ()
#2  0x0000000000555d48 in ggml_cuda_mul_mat(ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#3  0x000000000054f3a3 in ggml_cuda_compute_forward ()
#4  0x0000000000443cf1 in ggml_compute_forward ()
#5  0x0000000000433efc in ggml_graph_compute_thread ()
#6  0x00000000004337a8 in ggml_graph_compute ()
#7  0x0000000000577f5b in ggml_backend_cpu_graph_compute ()
#8  0x0000000000574a07 in ggml_backend_graph_compute ()
#9  0x00000000004e00c1 in llama_decode_internal(llama_context&, llama_batch) ()
#10 0x00000000004e0954 in llama_decode ()
#11 0x0000000000526e17 in llama_init_from_gpt_params(gpt_params&) ()
#12 0x0000000000415df3 in main ()
[Inferior 1 (process 9042) detached]
Aborted (core dumped)

Let me know if you need any further details. I appreciate all the contributors' great work on this program!

github-actions[bot] commented 5 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp

CUDA error: unknown error when offloading to gfx1035 #4770