SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
MIT License
7.96k stars 412 forks source link

ggml-cuda.cu:8949: invalid argument无效参数问题 #191

Closed NeverGpDzy closed 5 months ago

NeverGpDzy commented 5 months ago

问题

运行时出现 CUDA error 1 at /root/PowerInfer/ggml-cuda.cu:8949: invalid argument 所有依赖已经满足,请提供一下解决思路,谢谢

配置

Cpu:Intel(R) Xeon(R) Platinum 8474C

Gpu:NVIDIA GeForce RTX 4090 D Cuda:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

编译过程


(Powerinfer2) root@autodl-container-02b744a905-865b3ab4:~/PowerInfer# cmake -S . -B build -DLLAMA_CUBLAS=ON
cmake --build build --config Release
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.1.105") 
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 12.1.105
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Using CUDA architectures: 52;61;70
GNU ld (GNU Binutils for Ubuntu) 2.38
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done
-- Generating done
-- Build files have been written to: /root/PowerInfer/build
[  1%] Building C object CMakeFiles/ggml.dir/ggml.c.o
/root/PowerInfer/ggml.c: In function ‘ggml_get_n_tasks’:
/root/PowerInfer/ggml.c:2006:24: warning: array subscript 71 is above array bounds of ‘const char *[70]’ [-Warray-bounds]
 2006 |     return GGML_OP_NAME[op];
      |            ~~~~~~~~~~~~^~~~
/root/PowerInfer/ggml.c:1586:21: note: while referencing ‘GGML_OP_NAME’
 1586 | static const char * GGML_OP_NAME[GGML_OP_COUNT] = {
      |                     ^~~~~~~~~~~~
In file included from /usr/include/stdio.h:894,
                 from /root/PowerInfer/ggml.c:21:
In function ‘printf’,
    inlined from ‘ggml_graph_print’ at /root/PowerInfer/ggml.c:18011:9:
/usr/include/x86_64-linux-gnu/bits/stdio2.h:112:10: warning: ‘%16s’ directive argument is null [-Wformat-overflow=]
  112 |   return __printf_chk (__USE_FORTIFY_LEVEL - 1, __fmt, __va_arg_pack ());
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[  2%] Building C object CMakeFiles/ggml.dir/ggml-alloc.c.o
[  3%] Building C object CMakeFiles/ggml.dir/ggml-backend.c.o
[  4%] Building C object CMakeFiles/ggml.dir/ggml-quants.c.o
/root/PowerInfer/ggml-quants.c: In function ‘ggml_axpy_q4_0_q8_0’:
/root/PowerInfer/ggml-quants.c:2457:54: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
 2457 |         __m256 by = _mm256_loadu_ps((const __m256 *)((char *)vy+i*128));
      |                                                      ^
/root/PowerInfer/ggml-quants.c:2457:37: warning: passing argument 1 of ‘_mm256_loadu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
 2457 |         __m256 by = _mm256_loadu_ps((const __m256 *)((char *)vy+i*128));
      |                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                     |
      |                                     const __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
                 from /root/PowerInfer/ggml-impl.h:74,
                 from /root/PowerInfer/ggml-quants.h:3,
                 from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:903:31: note: expected ‘const float *’ but argument is of type ‘const __m256 *’
  903 | _mm256_loadu_ps (float const *__P)
      |                  ~~~~~~~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2460:36: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
 2460 |         _mm256_storeu_ps((__m256*)((char*)vz + i*128), by);
      |                                    ^
/root/PowerInfer/ggml-quants.c:2460:26: warning: passing argument 1 of ‘_mm256_storeu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
 2460 |         _mm256_storeu_ps((__m256*)((char*)vz + i*128), by);
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                          |
      |                          __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
                 from /root/PowerInfer/ggml-impl.h:74,
                 from /root/PowerInfer/ggml-quants.h:3,
                 from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:909:26: note: expected ‘float *’ but argument is of type ‘__m256 *’
  909 | _mm256_storeu_ps (float *__P, __m256 __A)
      |                   ~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2467:47: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
 2467 |         by = _mm256_loadu_ps((const __m256 *)((char*)vy+i*128+32));
      |                                               ^
/root/PowerInfer/ggml-quants.c:2467:30: warning: passing argument 1 of ‘_mm256_loadu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
 2467 |         by = _mm256_loadu_ps((const __m256 *)((char*)vy+i*128+32));
      |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                              |
      |                              const __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
                 from /root/PowerInfer/ggml-impl.h:74,
                 from /root/PowerInfer/ggml-quants.h:3,
                 from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:903:31: note: expected ‘const float *’ but argument is of type ‘const __m256 *’
  903 | _mm256_loadu_ps (float const *__P)
      |                  ~~~~~~~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2469:36: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
 2469 |         _mm256_storeu_ps((__m256*)((char*)vz + i*128+32), by);
      |                                    ^
/root/PowerInfer/ggml-quants.c:2469:26: warning: passing argument 1 of ‘_mm256_storeu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
 2469 |         _mm256_storeu_ps((__m256*)((char*)vz + i*128+32), by);
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                          |
      |                          __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
                 from /root/PowerInfer/ggml-impl.h:74,
                 from /root/PowerInfer/ggml-quants.h:3,
                 from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:909:26: note: expected ‘float *’ but argument is of type ‘__m256 *’
  909 | _mm256_storeu_ps (float *__P, __m256 __A)
      |                   ~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2479:47: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
 2479 |         by = _mm256_loadu_ps((const __m256 *)((char*)vy+i*128+64));
      |                                               ^
/root/PowerInfer/ggml-quants.c:2479:30: warning: passing argument 1 of ‘_mm256_loadu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
 2479 |         by = _mm256_loadu_ps((const __m256 *)((char*)vy+i*128+64));
      |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                              |
      |                              const __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
                 from /root/PowerInfer/ggml-impl.h:74,
                 from /root/PowerInfer/ggml-quants.h:3,
                 from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:903:31: note: expected ‘const float *’ but argument is of type ‘const __m256 *’
  903 | _mm256_loadu_ps (float const *__P)
      |                  ~~~~~~~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2482:36: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
 2482 |         _mm256_storeu_ps((__m256*)((char*)vz + i*128+64), by);
      |                                    ^
/root/PowerInfer/ggml-quants.c:2482:26: warning: passing argument 1 of ‘_mm256_storeu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
 2482 |         _mm256_storeu_ps((__m256*)((char*)vz + i*128+64), by);
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                          |
      |                          __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
                 from /root/PowerInfer/ggml-impl.h:74,
                 from /root/PowerInfer/ggml-quants.h:3,
                 from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:909:26: note: expected ‘float *’ but argument is of type ‘__m256 *’
  909 | _mm256_storeu_ps (float *__P, __m256 __A)
      |                   ~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2489:47: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
 2489 |         by = _mm256_loadu_ps((const __m256 *)((char*)vy+i*128+96));
      |                                               ^
/root/PowerInfer/ggml-quants.c:2489:30: warning: passing argument 1 of ‘_mm256_loadu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
 2489 |         by = _mm256_loadu_ps((const __m256 *)((char*)vy+i*128+96));
      |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                              |
      |                              const __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
                 from /root/PowerInfer/ggml-impl.h:74,
                 from /root/PowerInfer/ggml-quants.h:3,
                 from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:903:31: note: expected ‘const float *’ but argument is of type ‘const __m256 *’
  903 | _mm256_loadu_ps (float const *__P)
      |                  ~~~~~~~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2491:36: warning: cast discards ‘const’ qualifier from pointer target type [-Wcast-qual]
 2491 |         _mm256_storeu_ps((__m256*)((char*)vz + i*128+96), by);
      |                                    ^
/root/PowerInfer/ggml-quants.c:2491:26: warning: passing argument 1 of ‘_mm256_storeu_ps’ from incompatible pointer type [-Wincompatible-pointer-types]
 2491 |         _mm256_storeu_ps((__m256*)((char*)vz + i*128+96), by);
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                          |
      |                          __m256 *
In file included from /usr/lib/gcc/x86_64-linux-gnu/11/include/immintrin.h:43,
                 from /root/PowerInfer/ggml-impl.h:74,
                 from /root/PowerInfer/ggml-quants.h:3,
                 from /root/PowerInfer/ggml-quants.c:1:
/usr/lib/gcc/x86_64-linux-gnu/11/include/avxintrin.h:909:26: note: expected ‘float *’ but argument is of type ‘__m256 *’
  909 | _mm256_storeu_ps (float *__P, __m256 __A)
      |                   ~~~~~~~^~~
/root/PowerInfer/ggml-quants.c:2435:12: warning: unused variable ‘acc’ [-Wunused-variable]
 2435 |     __m256 acc = _mm256_setzero_ps();
      |            ^~~
[  5%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda.cu.o
/root/PowerInfer/ggml-cuda.cu(6717): warning #177-D: variable "ne0" was declared but never referenced
      const int64_t ne0 = src->ne[0];
                    ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/root/PowerInfer/ggml-cuda.cu(6979): warning #177-D: variable "nrows_dst" was declared but never referenced
      const int64_t nrows_dst = dst->backend == GGML_BACKEND_GPU && id == g_main_device ? ne0 : row_diff;
                    ^

/root/PowerInfer/ggml-cuda.cu(7204): warning #177-D: variable "ne10" was declared but never referenced
      const int64_t ne10 = src1->ne[1];
                    ^

/root/PowerInfer/ggml-cuda.cu(7226): warning #177-D: variable "src1_dfloat" was declared but never referenced
      const dfloat * src1_dfloat = (const dfloat *) src1_ddf_i;
                     ^

/root/PowerInfer/ggml-cuda.cu(7255): warning #177-D: variable "ne10" was declared but never referenced
      const int64_t ne10 = src1->ne[1];
                    ^

/root/PowerInfer/ggml-cuda.cu(7357): warning #177-D: variable "predict_idx" was declared but never referenced
      int predict_idx = idx;
          ^

/root/PowerInfer/ggml-cuda.cu(8430): warning #177-D: variable "ne01" was declared but never referenced
      const int64_t ne01 = src0->ne[1];
                    ^

/root/PowerInfer/ggml-cuda.cu(8773): warning #177-D: variable "all_on_device" was declared but never referenced
      bool all_on_device = (src0->backend == GGML_BACKEND_GPU || src0->backend == GGML_BACKEND_GPU_SPLIT) &&
           ^

/root/PowerInfer/ggml-cuda.cu(4421): warning #177-D: variable "bid" was declared but never referenced
      const int bid = blockIdx.y;
                ^

/root/PowerInfer/ggml-cuda.cu(4549): warning #177-D: variable "bid" was declared but never referenced
      const int bid = blockIdx.y;
                ^

/root/PowerInfer/ggml-cuda.cu(4484): warning #177-D: variable "d" was declared but never referenced
      short *d = (short *)((char *)vx + ncols * gpu_row * 2);
             ^

/root/PowerInfer/ggml-cuda.cu(4492): warning #177-D: variable "bid" was declared but never referenced
      const int bid = blockIdx.y;
                ^

/root/PowerInfer/ggml-cuda.cu(579): warning #177-D: function "sigmoid_f32" was declared but never referenced
                   void sigmoid_f32(const float * x, float * dst, const int k) {
                        ^

/root/PowerInfer/ggml-cuda.cu(5353): warning #177-D: function "dequantize_mul_mat_vec_q4_0_cuda_sparse" was declared but never referenced
  static void dequantize_mul_mat_vec_q4_0_cuda_sparse(const void * vx, const dfloat * y, float * dst, const int ncols, const int nrows, cudaStream_t stream, int *lst, float *idx) {
              ^

/root/PowerInfer/ggml-cuda.cu(6717): warning #177-D: variable "ne0" was declared but never referenced
      const int64_t ne0 = src->ne[0];
                    ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/root/PowerInfer/ggml-cuda.cu(6979): warning #177-D: variable "nrows_dst" was declared but never referenced
      const int64_t nrows_dst = dst->backend == GGML_BACKEND_GPU && id == g_main_device ? ne0 : row_diff;
                    ^

/root/PowerInfer/ggml-cuda.cu(7204): warning #177-D: variable "ne10" was declared but never referenced
      const int64_t ne10 = src1->ne[1];
                    ^

/root/PowerInfer/ggml-cuda.cu(7226): warning #177-D: variable "src1_dfloat" was declared but never referenced
      const dfloat * src1_dfloat = (const dfloat *) src1_ddf_i;
                     ^

/root/PowerInfer/ggml-cuda.cu(7255): warning #177-D: variable "ne10" was declared but never referenced
      const int64_t ne10 = src1->ne[1];
                    ^

/root/PowerInfer/ggml-cuda.cu(7357): warning #177-D: variable "predict_idx" was declared but never referenced
      int predict_idx = idx;
          ^

/root/PowerInfer/ggml-cuda.cu(8430): warning #177-D: variable "ne01" was declared but never referenced
      const int64_t ne01 = src0->ne[1];
                    ^

/root/PowerInfer/ggml-cuda.cu(8773): warning #177-D: variable "all_on_device" was declared but never referenced
      bool all_on_device = (src0->backend == GGML_BACKEND_GPU || src0->backend == GGML_BACKEND_GPU_SPLIT) &&
           ^

/root/PowerInfer/ggml-cuda.cu(4421): warning #177-D: variable "bid" was declared but never referenced
      const int bid = blockIdx.y;
                ^

/root/PowerInfer/ggml-cuda.cu(4549): warning #177-D: variable "bid" was declared but never referenced
      const int bid = blockIdx.y;
                ^

/root/PowerInfer/ggml-cuda.cu(4484): warning #177-D: variable "d" was declared but never referenced
      short *d = (short *)((char *)vx + ncols * gpu_row * 2);
             ^

/root/PowerInfer/ggml-cuda.cu(4492): warning #177-D: variable "bid" was declared but never referenced
      const int bid = blockIdx.y;
                ^

/root/PowerInfer/ggml-cuda.cu(579): warning #177-D: function "sigmoid_f32" was declared but never referenced
                   void sigmoid_f32(const float * x, float * dst, const int k) {
                        ^

/root/PowerInfer/ggml-cuda.cu(5353): warning #177-D: function "dequantize_mul_mat_vec_q4_0_cuda_sparse" was declared but never referenced
  static void dequantize_mul_mat_vec_q4_0_cuda_sparse(const void * vx, const dfloat * y, float * dst, const int ncols, const int nrows, cudaStream_t stream, int *lst, float *idx) {
              ^

/root/PowerInfer/ggml-cuda.cu(6717): warning #177-D: variable "ne0" was declared but never referenced
      const int64_t ne0 = src->ne[0];
                    ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/root/PowerInfer/ggml-cuda.cu(6979): warning #177-D: variable "nrows_dst" was declared but never referenced
      const int64_t nrows_dst = dst->backend == GGML_BACKEND_GPU && id == g_main_device ? ne0 : row_diff;
                    ^

/root/PowerInfer/ggml-cuda.cu(7204): warning #177-D: variable "ne10" was declared but never referenced
      const int64_t ne10 = src1->ne[1];
                    ^

/root/PowerInfer/ggml-cuda.cu(7226): warning #177-D: variable "src1_dfloat" was declared but never referenced
      const dfloat * src1_dfloat = (const dfloat *) src1_ddf_i;
                     ^

/root/PowerInfer/ggml-cuda.cu(7255): warning #177-D: variable "ne10" was declared but never referenced
      const int64_t ne10 = src1->ne[1];
                    ^

/root/PowerInfer/ggml-cuda.cu(7357): warning #177-D: variable "predict_idx" was declared but never referenced
      int predict_idx = idx;
          ^

/root/PowerInfer/ggml-cuda.cu(8430): warning #177-D: variable "ne01" was declared but never referenced
      const int64_t ne01 = src0->ne[1];
                    ^

/root/PowerInfer/ggml-cuda.cu(8773): warning #177-D: variable "all_on_device" was declared but never referenced
      bool all_on_device = (src0->backend == GGML_BACKEND_GPU || src0->backend == GGML_BACKEND_GPU_SPLIT) &&
           ^

/root/PowerInfer/ggml-cuda.cu(4421): warning #177-D: variable "bid" was declared but never referenced
      const int bid = blockIdx.y;
                ^

/root/PowerInfer/ggml-cuda.cu(4549): warning #177-D: variable "bid" was declared but never referenced
      const int bid = blockIdx.y;
                ^

/root/PowerInfer/ggml-cuda.cu(4484): warning #177-D: variable "d" was declared but never referenced
      short *d = (short *)((char *)vx + ncols * gpu_row * 2);
             ^

/root/PowerInfer/ggml-cuda.cu(4492): warning #177-D: variable "bid" was declared but never referenced
      const int bid = blockIdx.y;
                ^

/root/PowerInfer/ggml-cuda.cu(579): warning #177-D: function "sigmoid_f32" was declared but never referenced
                   void sigmoid_f32(const float * x, float * dst, const int k) {
                        ^

/root/PowerInfer/ggml-cuda.cu(5353): warning #177-D: function "dequantize_mul_mat_vec_q4_0_cuda_sparse" was declared but never referenced
  static void dequantize_mul_mat_vec_q4_0_cuda_sparse(const void * vx, const dfloat * y, float * dst, const int ncols, const int nrows, cudaStream_t stream, int *lst, float *idx) {
              ^

/root/PowerInfer/ggml-cuda.cu: In function ‘void ggml_cuda_op_mul_mat_batch_sparse(const ggml_tensor*, const ggml_tensor*, ggml_tensor*, const char*, const float*, const char*, float*, int64_t, int64_t, int64_t, int64_t, CUstream_st* const&)’:
/root/PowerInfer/ggml-cuda.cu:6962:1: warning: unused parameter ‘src1_ddq_i’ [-Wunused-parameter]
 6961 |     const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, const char * src0_dd_i, const float * src1_ddf_i,
      |                                                                                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 6962 |     const char * src1_ddq_i, float * dst_dd_i, const int64_t row_low, const int64_t row_high, const int64_t src1_ncols,
      | ^   ~~~~~~
/root/PowerInfer/ggml-cuda.cu:6963:1: warning: unused parameter ‘src1_padded_row_size’ [-Wunused-parameter]
 6962 |     const char * src1_ddq_i, float * dst_dd_i, const int64_t row_low, const int64_t row_high, const int64_t src1_ncols,
      |                                                                                                       ~~~~~~~~~~~~~~~~~
 6963 |     const int64_t src1_padded_row_size, const cudaStream_t & stream) {
      | ^   ~~~~~~~~~~~~~~~~
/root/PowerInfer/ggml-cuda.cu: In function ‘void ggml_cuda_op_mul_mat_vec_sparse_dequantized(const ggml_tensor*, const ggml_tensor*, ggml_tensor*, const char*, const float*, const char*, float*, int64_t, int64_t, int64_t, int64_t, CUstream_st* const&)’:
/root/PowerInfer/ggml-cuda.cu:7251:1: warning: unused parameter ‘src1_ddq_i’ [-Wunused-parameter]
 7250 |     const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, const char * src0_dd_i, const float * src1_ddf_i,
      |                                                                                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 7251 |     const char * src1_ddq_i, float * dst_dd_i, const int64_t row_low, const int64_t row_high, const int64_t src1_ncols,
      | ^   ~~~~~~
/root/PowerInfer/ggml-cuda.cu: In function ‘void ggml_cuda_op_mul_mat_transpose_select_gemm(const ggml_tensor*, const ggml_tensor*, ggml_tensor*, const char*, const float*, const char*, float*, int64_t, int64_t, int64_t, int64_t, CUstream_st* const&)’:
/root/PowerInfer/ggml-cuda.cu:7452:91: warning: cast from type ‘const float*’ to type ‘float*’ casts away qualifiers [-Wcast-qual]
 7452 |     transpose_cont<<< numBlocks, blockSize, 0, stream>>>((float *)src0_ddf_i, transpose, ne00, ne01, 1, ne00, ne01,NULL);
      |                                                                                           ^~~~~~~~~~~~~~~~~~~
/root/PowerInfer/ggml-cuda.cu: At global scope:
/root/PowerInfer/ggml-cuda.cu:8771:6: warning: no previous declaration for ‘void ggml_cuda_axpy(const ggml_tensor*, const ggml_tensor*, ggml_tensor*)’ [-Wmissing-declarations]
 8771 | void ggml_cuda_axpy(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
      |      ^~~~~~~~~~~~~~
/root/PowerInfer/ggml-cuda.cu: In function ‘void ggml_cuda_op_mul_mat_transpose_gemm(const ggml_tensor*, const ggml_tensor*, ggml_tensor*, const char*, const float*, const char*, float*, int64_t, int64_t, int64_t, int64_t, CUstream_st* const&)’:
/root/PowerInfer/ggml-cuda.cu:7550:20: warning: ‘src0_ddq_as_f32’ may be used uninitialized in this function [-Wmaybe-uninitialized]
 7550 |         ggml_cuda_pool_free(src0_ddq_as_f32, src0_as);
      |         ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/ggml-cuda.cu:7501:8: note: ‘src0_ddq_as_f32’ was declared here
 7501 |     float * src0_ddq_as_f32;
      |        ^~~~~~~~~~~~~~~
[  5%] Built target ggml
[  6%] Linking CUDA static library libggml_static.a
[  6%] Built target ggml_static
[  7%] Building CXX object CMakeFiles/llama.dir/llama.cpp.o
/root/PowerInfer/llama.cpp:632:26: warning: no previous declaration for ‘tensor_offloading_levels get_offloading_level(llm_tensor)’ [-Wmissing-declarations]
  632 | tensor_offloading_levels get_offloading_level(llm_tensor tensor) {
      |                          ^~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/llama.cpp: In function ‘int64_t sum_gpu_index(ggml_tensor*)’:
/root/PowerInfer/llama.cpp:2722:39: warning: missing initializer for member ‘ggml_init_params::mem_buffer’ [-Wmissing-field-initializers]
 2722 |     ggml_context * ctx_aux = ggml_init({
      |                              ~~~~~~~~~^~
 2723 |         /* mem_size */ 1 << 10,
      |         ~~~~~~~~~~~~~~~~~~~~~~~        
 2724 |     });
      |     ~~                                 
/root/PowerInfer/llama.cpp:2722:39: warning: missing initializer for member ‘ggml_init_params::no_alloc’ [-Wmissing-field-initializers]
/root/PowerInfer/llama.cpp: In lambda function:
/root/PowerInfer/llama.cpp:2805:47: warning: unused parameter ‘progress’ [-Wunused-parameter]
 2805 |         llama_progress_callback cb = [](float progress, void *ctx) {
      |                                         ~~~~~~^~~~~~~~
/root/PowerInfer/llama.cpp:2805:63: warning: unused parameter ‘ctx’ [-Wunused-parameter]
 2805 |         llama_progress_callback cb = [](float progress, void *ctx) {
      |                                                         ~~~~~~^~~
/root/PowerInfer/llama.cpp: In member function ‘size_t llama_augmentation_model_loader::slice_ffn_mat_to_gpu(llama_layer&)’:
/root/PowerInfer/llama.cpp:2909:23: warning: unused variable ‘gpu_idx’ [-Wunused-variable]
 2909 |         ggml_tensor * gpu_idx = layer.gpu_idx;
      |                       ^~~~~~~
/root/PowerInfer/llama.cpp: In function ‘void llm_load_sparse_model_tensors(llama_model_loader&, llama_model&, const llama_context_params*, int, long int, bool, bool, bool, llama_progress_callback, void*)’:
/root/PowerInfer/llama.cpp:3165:28: warning: variable ‘llama_backend_offload’ set but not used [-Wunused-but-set-variable]
 3165 |     enum ggml_backend_type llama_backend_offload = GGML_BACKEND_CPU;
      |                            ^~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/llama.cpp:3166:28: warning: variable ‘llama_backend_offload_split’ set but not used [-Wunused-but-set-variable]
 3166 |     enum ggml_backend_type llama_backend_offload_split = GGML_BACKEND_CPU;
      |                            ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/llama.cpp: In function ‘void llama_reserve_model_kv_cache(llama_model*, const llama_context_params*)’:
/root/PowerInfer/llama.cpp:3319:29: warning: comparison of integer expressions of different signedness: ‘int’ and ‘unsigned int’ [-Wsign-compare]
 3319 |     if (model->n_gpu_layers < hparams.n_layer + 1) {
      |         ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/llama.cpp: In function ‘std::pair<ggml_tensor*, ggml_tensor*> llm_build_kv_store(ggml_context*, const llama_hparams&, const llama_kv_cache&, ggml_cgraph*, ggml_tensor*, ggml_tensor*, int64_t, int32_t, int32_t, const llm_build_cb&, int64_t)’:
/root/PowerInfer/llama.cpp:4232:31: warning: unused parameter ‘graph’ [-Wunused-parameter]
 4232 |          struct ggml_cgraph * graph,
      |          ~~~~~~~~~~~~~~~~~~~~~^~~~~
/root/PowerInfer/llama.cpp: In lambda function:
/root/PowerInfer/llama.cpp:4677:88: warning: unused parameter ‘nl’ [-Wunused-parameter]
 4677 | const llm_build_cb no_offload_cb = [](struct ggml_tensor * cur, const char * name, int nl) {
      |                                                                                    ~~~~^~
/root/PowerInfer/llama.cpp: In function ‘int llama_decode_internal(llama_context&, llama_batch)’:
/root/PowerInfer/llama.cpp:6592:16: warning: unused variable ‘full_offload_supported’ [-Wunused-variable]
 6592 |     const bool full_offload_supported =
      |                ^~~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/llama.cpp: In function ‘llama_model_params llama_model_default_params()’:
/root/PowerInfer/llama.cpp:9400:5: warning: missing initializer for member ‘llama_model_params::reset_gpu_index’ [-Wmissing-field-initializers]
 9400 |     };
      |     ^
/root/PowerInfer/llama.cpp:9400:5: warning: missing initializer for member ‘llama_model_params::disable_gpu_index’ [-Wmissing-field-initializers]
[  8%] Linking CXX static library libllama.a
[  8%] Built target llama
[  9%] Generating build details from Git
-- Found Git: /usr/bin/git (found version "2.34.1") 
[ 10%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[ 10%] Built target build_info
[ 12%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o
[ 13%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
[ 14%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o
[ 15%] Building CXX object common/CMakeFiles/common.dir/grammar-parser.cpp.o
[ 16%] Building CXX object common/CMakeFiles/common.dir/train.cpp.o
[ 17%] Linking CXX static library libcommon.a
[ 17%] Built target common
[ 18%] Building CXX object tests/CMakeFiles/test-quantize-fns.dir/test-quantize-fns.cpp.o
[ 19%] Linking CXX executable ../bin/test-quantize-fns
[ 19%] Built target test-quantize-fns
[ 20%] Building CXX object tests/CMakeFiles/test-quantize-perf.dir/test-quantize-perf.cpp.o
[ 21%] Linking CXX executable ../bin/test-quantize-perf
[ 21%] Built target test-quantize-perf
[ 23%] Building CXX object tests/CMakeFiles/test-sampling.dir/test-sampling.cpp.o
[ 24%] Linking CXX executable ../bin/test-sampling
[ 24%] Built target test-sampling
[ 25%] Building CXX object tests/CMakeFiles/test-tokenizer-0-llama.dir/test-tokenizer-0-llama.cpp.o
[ 26%] Linking CXX executable ../bin/test-tokenizer-0-llama
[ 26%] Built target test-tokenizer-0-llama
[ 27%] Building CXX object tests/CMakeFiles/test-tokenizer-0-falcon.dir/test-tokenizer-0-falcon.cpp.o
[ 28%] Linking CXX executable ../bin/test-tokenizer-0-falcon
[ 28%] Built target test-tokenizer-0-falcon
[ 29%] Building CXX object tests/CMakeFiles/test-tokenizer-1-llama.dir/test-tokenizer-1-llama.cpp.o
[ 30%] Linking CXX executable ../bin/test-tokenizer-1-llama
[ 30%] Built target test-tokenizer-1-llama
[ 31%] Building CXX object tests/CMakeFiles/test-tokenizer-1-bpe.dir/test-tokenizer-1-bpe.cpp.o
[ 32%] Linking CXX executable ../bin/test-tokenizer-1-bpe
[ 32%] Built target test-tokenizer-1-bpe
[ 34%] Building CXX object tests/CMakeFiles/test-grammar-parser.dir/test-grammar-parser.cpp.o
[ 35%] Linking CXX executable ../bin/test-grammar-parser
[ 35%] Built target test-grammar-parser
[ 36%] Building CXX object tests/CMakeFiles/test-llama-grammar.dir/test-llama-grammar.cpp.o
In file included from /root/PowerInfer/tests/test-llama-grammar.cpp:5:
/root/PowerInfer/./llama.cpp:632:26: warning: no previous declaration for ‘tensor_offloading_levels get_offloading_level(llm_tensor)’ [-Wmissing-declarations]
  632 | tensor_offloading_levels get_offloading_level(llm_tensor tensor) {
      |                          ^~~~~~~~~~~~~~~~~~~~
In file included from /root/PowerInfer/tests/test-llama-grammar.cpp:5:
/root/PowerInfer/./llama.cpp: In function ‘int64_t sum_gpu_index(ggml_tensor*)’:
/root/PowerInfer/./llama.cpp:2722:39: warning: missing initializer for member ‘ggml_init_params::mem_buffer’ [-Wmissing-field-initializers]
 2722 |     ggml_context * ctx_aux = ggml_init({
      |                              ~~~~~~~~~^~
 2723 |         /* mem_size */ 1 << 10,
      |         ~~~~~~~~~~~~~~~~~~~~~~~        
 2724 |     });
      |     ~~                                 
/root/PowerInfer/./llama.cpp:2722:39: warning: missing initializer for member ‘ggml_init_params::no_alloc’ [-Wmissing-field-initializers]
/root/PowerInfer/./llama.cpp: In lambda function:
/root/PowerInfer/./llama.cpp:2805:47: warning: unused parameter ‘progress’ [-Wunused-parameter]
 2805 |         llama_progress_callback cb = [](float progress, void *ctx) {
      |                                         ~~~~~~^~~~~~~~
/root/PowerInfer/./llama.cpp:2805:63: warning: unused parameter ‘ctx’ [-Wunused-parameter]
 2805 |         llama_progress_callback cb = [](float progress, void *ctx) {
      |                                                         ~~~~~~^~~
/root/PowerInfer/./llama.cpp: In member function ‘size_t llama_augmentation_model_loader::slice_ffn_mat_to_gpu(llama_layer&)’:
/root/PowerInfer/./llama.cpp:2909:23: warning: unused variable ‘gpu_idx’ [-Wunused-variable]
 2909 |         ggml_tensor * gpu_idx = layer.gpu_idx;
      |                       ^~~~~~~
/root/PowerInfer/./llama.cpp: In function ‘void llm_load_sparse_model_tensors(llama_model_loader&, llama_model&, const llama_context_params*, int, long int, bool, bool, bool, llama_progress_callback, void*)’:
/root/PowerInfer/./llama.cpp:3165:28: warning: variable ‘llama_backend_offload’ set but not used [-Wunused-but-set-variable]
 3165 |     enum ggml_backend_type llama_backend_offload = GGML_BACKEND_CPU;
      |                            ^~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/./llama.cpp:3166:28: warning: variable ‘llama_backend_offload_split’ set but not used [-Wunused-but-set-variable]
 3166 |     enum ggml_backend_type llama_backend_offload_split = GGML_BACKEND_CPU;
      |                            ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/./llama.cpp: In function ‘void llama_reserve_model_kv_cache(llama_model*, const llama_context_params*)’:
/root/PowerInfer/./llama.cpp:3319:29: warning: comparison of integer expressions of different signedness: ‘int’ and ‘unsigned int’ [-Wsign-compare]
 3319 |     if (model->n_gpu_layers < hparams.n_layer + 1) {
      |         ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/./llama.cpp: In function ‘std::pair<ggml_tensor*, ggml_tensor*> llm_build_kv_store(ggml_context*, const llama_hparams&, const llama_kv_cache&, ggml_cgraph*, ggml_tensor*, ggml_tensor*, int64_t, int32_t, int32_t, const llm_build_cb&, int64_t)’:
/root/PowerInfer/./llama.cpp:4232:31: warning: unused parameter ‘graph’ [-Wunused-parameter]
 4232 |          struct ggml_cgraph * graph,
      |          ~~~~~~~~~~~~~~~~~~~~~^~~~~
/root/PowerInfer/./llama.cpp: In lambda function:
/root/PowerInfer/./llama.cpp:4677:88: warning: unused parameter ‘nl’ [-Wunused-parameter]
 4677 | const llm_build_cb no_offload_cb = [](struct ggml_tensor * cur, const char * name, int nl) {
      |                                                                                    ~~~~^~
/root/PowerInfer/./llama.cpp: In function ‘int llama_decode_internal(llama_context&, llama_batch)’:
/root/PowerInfer/./llama.cpp:6592:16: warning: unused variable ‘full_offload_supported’ [-Wunused-variable]
 6592 |     const bool full_offload_supported =
      |                ^~~~~~~~~~~~~~~~~~~~~~
/root/PowerInfer/./llama.cpp: In function ‘llama_model_params llama_model_default_params()’:
/root/PowerInfer/./llama.cpp:9400:5: warning: missing initializer for member ‘llama_model_params::reset_gpu_index’ [-Wmissing-field-initializers]
 9400 |     };
      |     ^
/root/PowerInfer/./llama.cpp:9400:5: warning: missing initializer for member ‘llama_model_params::disable_gpu_index’ [-Wmissing-field-initializers]
[ 37%] Linking CXX executable ../bin/test-llama-grammar
[ 37%] Built target test-llama-grammar
[ 38%] Building CXX object tests/CMakeFiles/test-grad0.dir/test-grad0.cpp.o
[ 39%] Linking CXX executable ../bin/test-grad0
[ 39%] Built target test-grad0
[ 40%] Building CXX object tests/CMakeFiles/test-rope.dir/test-rope.cpp.o
[ 41%] Linking CXX executable ../bin/test-rope
[ 41%] Built target test-rope
[ 42%] Building C object tests/CMakeFiles/test-c.dir/test-c.c.o
[ 43%] Linking CXX executable ../bin/test-c
[ 43%] Built target test-c
[ 45%] Building CXX object examples/baby-llama/CMakeFiles/baby-llama.dir/baby-llama.cpp.o
[ 46%] Linking CXX executable ../../bin/baby-llama
[ 46%] Built target baby-llama
[ 47%] Building CXX object examples/batched/CMakeFiles/batched.dir/batched.cpp.o
[ 48%] Linking CXX executable ../../bin/batched
[ 48%] Built target batched
[ 49%] Building CXX object examples/batched-bench/CMakeFiles/batched-bench.dir/batched-bench.cpp.o
[ 50%] Linking CXX executable ../../bin/batched-bench
[ 50%] Built target batched-bench
[ 51%] Building CXX object examples/beam-search/CMakeFiles/beam-search.dir/beam-search.cpp.o
[ 52%] Linking CXX executable ../../bin/beam-search
[ 52%] Built target beam-search
[ 53%] Building CXX object examples/benchmark/CMakeFiles/benchmark.dir/benchmark-matmult.cpp.o
[ 54%] Linking CXX executable ../../bin/benchmark
[ 54%] Built target benchmark
[ 56%] Building CXX object examples/convert-llama2c-to-ggml/CMakeFiles/convert-llama2c-to-ggml.dir/convert-llama2c-to-ggml.cpp.o
[ 57%] Linking CXX executable ../../bin/convert-llama2c-to-ggml
[ 57%] Built target convert-llama2c-to-ggml
[ 58%] Building CXX object examples/embedding/CMakeFiles/embedding.dir/embedding.cpp.o
[ 59%] Linking CXX executable ../../bin/embedding
[ 59%] Built target embedding
[ 60%] Building CXX object examples/finetune/CMakeFiles/finetune.dir/finetune.cpp.o
[ 61%] Linking CXX executable ../../bin/finetune
[ 61%] Built target finetune
[ 62%] Building CXX object examples/infill/CMakeFiles/infill.dir/infill.cpp.o
[ 63%] Linking CXX executable ../../bin/infill
[ 63%] Built target infill
[ 64%] Building CXX object examples/llama-bench/CMakeFiles/llama-bench.dir/llama-bench.cpp.o
[ 65%] Linking CXX executable ../../bin/llama-bench
[ 65%] Built target llama-bench
[ 67%] Building CXX object examples/llava/CMakeFiles/llava.dir/llava.cpp.o
/root/PowerInfer/examples/llava/llava.cpp: In function ‘bool load_file_to_bytes(const char*, unsigned char**, long int*)’:
/root/PowerInfer/examples/llava/llava.cpp:130:10: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’ declared with attribute ‘warn_unused_result’ [-Wunused-result]
  130 |     fread(buffer, 1, fileSize, file); // Read the file into the buffer
      |     ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
[ 68%] Building CXX object examples/llava/CMakeFiles/llava.dir/clip.cpp.o
[ 68%] Built target llava
[ 69%] Linking CXX static library libllava_static.a
[ 69%] Built target llava_static
[ 70%] Building CXX object examples/llava/CMakeFiles/llava-cli.dir/llava-cli.cpp.o
[ 71%] Linking CXX executable ../../bin/llava-cli
[ 71%] Built target llava-cli
[ 72%] Building CXX object examples/main/CMakeFiles/main.dir/main.cpp.o
[ 73%] Linking CXX executable ../../bin/main
[ 73%] Built target main
[ 74%] Building CXX object examples/parallel/CMakeFiles/parallel.dir/parallel.cpp.o
[ 75%] Linking CXX executable ../../bin/parallel
[ 75%] Built target parallel
[ 76%] Building CXX object examples/perplexity/CMakeFiles/perplexity.dir/perplexity.cpp.o
[ 78%] Linking CXX executable ../../bin/perplexity
[ 78%] Built target perplexity
[ 79%] Building CXX object examples/quantize/CMakeFiles/quantize.dir/quantize.cpp.o
[ 80%] Linking CXX executable ../../bin/quantize
[ 80%] Built target quantize
[ 81%] Building CXX object examples/quantize-stats/CMakeFiles/quantize-stats.dir/quantize-stats.cpp.o
[ 82%] Linking CXX executable ../../bin/quantize-stats
[ 82%] Built target quantize-stats
[ 83%] Building CXX object examples/save-load-state/CMakeFiles/save-load-state.dir/save-load-state.cpp.o
[ 84%] Linking CXX executable ../../bin/save-load-state
[ 84%] Built target save-load-state
[ 85%] Building CXX object examples/simple/CMakeFiles/simple.dir/simple.cpp.o
[ 86%] Linking CXX executable ../../bin/simple
[ 86%] Built target simple
[ 87%] Building CXX object examples/speculative/CMakeFiles/speculative.dir/speculative.cpp.o
[ 89%] Linking CXX executable ../../bin/speculative
[ 89%] Built target speculative
[ 90%] Building CXX object examples/train-text-from-scratch/CMakeFiles/train-text-from-scratch.dir/train-text-from-scratch.cpp.o
[ 91%] Linking CXX executable ../../bin/train-text-from-scratch
[ 91%] Built target train-text-from-scratch
[ 92%] Building CXX object examples/server/CMakeFiles/server.dir/server.cpp.o
In copy constructor ‘task_result::task_result(const task_result&)’,
    inlined from ‘void __gnu_cxx::new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_result; _Args = {const task_result&}; _Tp = task_result]’ at /usr/include/c++/11/ext/new_allocator.h:162:4,
    inlined from ‘static void std::allocator_traits<std::allocator<_Tp1> >::construct(std::allocator_traits<std::allocator<_Tp1> >::allocator_type&, _Up*, _Args&& ...) [with _Up = task_result; _Args = {const task_result&}; _Tp = task_result]’ at /usr/include/c++/11/bits/alloc_traits.h:516:17,
    inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_result; _Alloc = std::allocator<task_result>]’ at /usr/include/c++/11/bits/stl_vector.h:1192:30,
    inlined from ‘void llama_server_context::send_error(int, std::string)’ at /root/PowerInfer/examples/server/server.cpp:1097:32:
/root/PowerInfer/examples/server/server.cpp:154:8: warning: ‘res.task_result::stop’ may be used uninitialized [-Wmaybe-uninitialized]
  154 | struct task_result {
      |        ^~~~~~~~~~~
/root/PowerInfer/examples/server/server.cpp: In member function ‘void llama_server_context::send_error(int, std::string)’:
/root/PowerInfer/examples/server/server.cpp:1093:21: note: ‘res’ declared here
 1093 |         task_result res;
      |                     ^~~
In copy constructor ‘task_server::task_server(const task_server&)’,
    inlined from ‘void __gnu_cxx::new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/11/ext/new_allocator.h:162:4,
    inlined from ‘static void std::allocator_traits<std::allocator<_Tp1> >::construct(std::allocator_traits<std::allocator<_Tp1> >::allocator_type&, _Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/11/bits/alloc_traits.h:516:17,
    inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_server; _Alloc = std::allocator<task_server>]’ at /usr/include/c++/11/bits/stl_vector.h:1192:30,
    inlined from ‘int llama_server_context::request_completion(json, bool, bool)’ at /root/PowerInfer/examples/server/server.cpp:1259:30,
    inlined from ‘main(int, char**)::<lambda(const httplib::Request&, httplib::Response&)>’ at /root/PowerInfer/examples/server/server.cpp:2355:61:
/root/PowerInfer/examples/server/server.cpp:145:8: warning: ‘task.task_server::target_id’ may be used uninitialized [-Wmaybe-uninitialized]
  145 | struct task_server {
      |        ^~~~~~~~~~~
/root/PowerInfer/examples/server/server.cpp: In lambda function:
/root/PowerInfer/examples/server/server.cpp:1253:21: note: ‘task’ declared here
 1253 |         task_server task;
      |                     ^~~~
In copy constructor ‘task_server::task_server(const task_server&)’,
    inlined from ‘void __gnu_cxx::new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/11/ext/new_allocator.h:162:4,
    inlined from ‘static void std::allocator_traits<std::allocator<_Tp1> >::construct(std::allocator_traits<std::allocator<_Tp1> >::allocator_type&, _Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/11/bits/alloc_traits.h:516:17,
    inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_server; _Alloc = std::allocator<task_server>]’ at /usr/include/c++/11/bits/stl_vector.h:1192:30,
    inlined from ‘int llama_server_context::request_completion(json, bool, bool)’ at /root/PowerInfer/examples/server/server.cpp:1259:30,
    inlined from ‘main(int, char**)::<lambda(const httplib::Request&, httplib::Response&)>’ at /root/PowerInfer/examples/server/server.cpp:2410:61:
/root/PowerInfer/examples/server/server.cpp:145:8: warning: ‘task.task_server::target_id’ may be used uninitialized [-Wmaybe-uninitialized]
  145 | struct task_server {
      |        ^~~~~~~~~~~
/root/PowerInfer/examples/server/server.cpp: In lambda function:
/root/PowerInfer/examples/server/server.cpp:1253:21: note: ‘task’ declared here
 1253 |         task_server task;
      |                     ^~~~
In copy constructor ‘task_server::task_server(const task_server&)’,
    inlined from ‘void __gnu_cxx::new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/11/ext/new_allocator.h:162:4,
    inlined from ‘static void std::allocator_traits<std::allocator<_Tp1> >::construct(std::allocator_traits<std::allocator<_Tp1> >::allocator_type&, _Up*, _Args&& ...) [with _Up = task_server; _Args = {const task_server&}; _Tp = task_server]’ at /usr/include/c++/11/bits/alloc_traits.h:516:17,
    inlined from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = task_server; _Alloc = std::allocator<task_server>]’ at /usr/include/c++/11/bits/stl_vector.h:1192:30,
    inlined from ‘int llama_server_context::request_completion(json, bool, bool)’ at /root/PowerInfer/examples/server/server.cpp:1259:30,
    inlined from ‘main(int, char**)::<lambda(const httplib::Request&, httplib::Response&)>’ at /root/PowerInfer/examples/server/server.cpp:2514:61:
/root/PowerInfer/examples/server/server.cpp:145:8: warning: ‘task.task_server::target_id’ may be used uninitialized [-Wmaybe-uninitialized]
  145 | struct task_server {
      |        ^~~~~~~~~~~
/root/PowerInfer/examples/server/server.cpp: In lambda function:
/root/PowerInfer/examples/server/server.cpp:1253:21: note: ‘task’ declared here
 1253 |         task_server task;
      |                     ^~~~
[ 93%] Linking CXX executable ../../bin/server
[ 93%] Built target server
[ 94%] Building CXX object examples/export-lora/CMakeFiles/export-lora.dir/export-lora.cpp.o
[ 95%] Linking CXX executable ../../bin/export-lora
[ 95%] Built target export-lora
[ 96%] Building CXX object pocs/vdot/CMakeFiles/vdot.dir/vdot.cpp.o
[ 97%] Linking CXX executable ../../bin/vdot
[ 97%] Built target vdot
[ 98%] Building CXX object pocs/vdot/CMakeFiles/q8dot.dir/q8dot.cpp.o
[100%] Linking CXX executable ../../bin/q8dot
[100%] Built target q8dot

运行过程:

(Powerinfer2) root@autodl-container-02b744a905-865b3ab4:~/PowerInfer/build/bin# /root/PowerInfer/build/bin/main -m /root/autodl-tmp/llm/ReluLLaMA-13B-PowerInfer-GGUF/llama-13b-relu.powerinfer.gguf -n 128 -t 8 -p "In the depths of" --ignore-eos
Log start
main: build = 1579 (3f638f7)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1717592515
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9
llama_model_loader: loaded meta data with 22 key-value pairs and 443 tensors from /root/autodl-tmp/llm/ReluLLaMA-13B-PowerInfer-GGUF/llama-13b-relu.powerinfer.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    2:          blk.0.ffn_down_t.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight f16      [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight f16      [  5120, 13824,     1,     1 ]
 ............
llama_model_loader: - tensor  440:                blk.38.fc2.weight f16      [  2048, 13824,     1,     1 ]
llama_model_loader: - tensor  441:                blk.39.fc1.weight f16      [  5120,  2048,     1,     1 ]
llama_model_loader: - tensor  442:                blk.39.fc2.weight f16      [  2048, 13824,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                       llama.rope.freq_base f32     
llama_model_loader: - kv  11:                          general.file_type u32     
llama_model_loader: - kv  12:                       tokenizer.ggml.model str     
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32     
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool    
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool    
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  362 tensors
llama_model_load: PowerInfer model loaded. Sparse inference will be used.
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = mostly F16
llm_load_print_meta: model params     = 14.16 B
llm_load_print_meta: model size       = 26.38 GiB (16.00 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_print_meta: sparse_pred_threshold = 0.00
llm_load_sparse_model_tensors: ggml ctx size =    0.16 MB
llm_load_sparse_model_tensors: using CUDA for GPU acceleration
llm_load_sparse_model_tensors: offloaded layers from VRAM budget(24735645696 bytes): 41/40
llm_load_sparse_model_tensors: mem required  = 27009.74 MB
llm_load_sparse_model_tensors: VRAM used: 10497.08 MB
............................................................
CUDA error 1 at /root/PowerInfer/ggml-cuda.cu:8949: invalid argument
current device: 0
hodlen commented 5 months ago

我在完全相同的硬件、软件组合,在最新版本的代码上运行相同的模型,并非复现此问题。或许你可以尝试通过 md5sum 这样的工具,尝试检查一下模型文件是否损坏,并且清理掉所有编译产物后重新编译一下,看是否会出现同样的问题。

此外,报错的代码位于 cudaMemCpy ,在模型向GPU加载的过程中,也并不是常见的错误点。如果模型文件没有问题,可能问题出在运行环境/GPU上,这样的问题通常不容易排查。

NeverGpDzy commented 5 months ago

检查了一下确实是模型损坏的问题 十分感谢您的回复🙏