SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
MIT License
7.9k stars 408 forks source link

No CUDA toolset found #119

Open c469591 opened 8 months ago

c469591 commented 8 months ago

Question Details

Hello, I encountered an error while using cmake. My system is Windows 10 with Python 3.11 and NVIDIA 3060. Below is the content of the error report. And I have correctly installed CUDA.

(llm) I:\llm>cmake -S . -B build -DLLAMA_CUBLAS=ON
-- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.19045.
-- cuBLAS found
CMake Error at C:/Program Files/CMake/share/cmake-3.28/Modules/CMakeDetermineCompilerId.cmake:529 (message):
  No CUDA toolset found.
Call Stack (most recent call first):
  C:/Program Files/CMake/share/cmake-3.28/Modules/CMakeDetermineCompilerId.cmake:8 (CMAKE_DETERMINE_COMPILER_ID_BUILD)
  C:/Program Files/CMake/share/cmake-3.28/Modules/CMakeDetermineCompilerId.cmake:53 (__determine_compiler_id_test)
  C:/Program Files/CMake/share/cmake-3.28/Modules/CMakeDetermineCUDACompiler.cmake:135 (CMAKE_DETERMINE_COMPILER_ID)
  CMakeLists.txt:258 (enable_language)

-- Configuring incomplete, errors occurred!

Additional Context

windows10 python3.11 NVIDIA3060 Today's cloned repository Today I installed the latest stable version of CMake.

aoguai commented 8 months ago

I encountered the same issue as well.

Environment Information:

Error during PowerInfer Setup:

  1. Using CMake:

    • Cloned PowerInfer repository:

      git clone https://github.com/bobozi-cmd/PowerInfer
      cd PowerInfer
    • Installed dependencies:

      pip install -r requirements.txt
    • Ran CMake configuration:

      cmake -S . -B build -DLLAMA_CUBLAS=ON
    • Error Encountered:

      CMake Error at D:/yy/CMake/share/cmake-3.28/Modules/CMakeDetermineCompilerId.cmake:529 (message):
      No CUDA toolset found.
      Call Stack (most recent call first):
      D:/yy/CMake/share/cmake-3.28/Modules/CMakeDetermineCompilerId.cmake:8 (CMAKE_DETERMINE_COMPILER_ID_BUILD)
      D:/yy/CMake/share/cmake-3.28/Modules/CMakeDetermineCompilerId.cmake:53 (__determine_compiler_id_test)
      D:/yy/CMake/share/cmake-3.28/Modules/CMakeDetermineCUDACompiler.cmake:135 (CMAKE_DETERMINE_COMPILER_ID)
      CMakeLists.txt:258 (enable_language)
      
      -- Configuring incomplete, errors occurred!
      
  2. Using w64devkit:

    • Downloaded the latest Fortran version of w64devkit.

    • Executed w64devkit:

      w64devkit.exe
    • Navigated to PowerInfer folder:

      cd PowerInfer
    • Attempted to build using make.

    • Error Encountered:

      In file included from ggml.h:217,
                     from ggml-impl.h:3,
                     from ggml.c:4:
      atomic_windows.h: In function '__msvc_xchg_i8':
      atomic_windows.h:103:12: error: implicit declaration of function '_InterlockedExchange8'; did you mean '_InterlockedExchange'? [-Werror=implicit-function-declaration]
      103 |     return _InterlockedExchange8(addr, val);
          |            ^~~~~~~~~~~~~~~~~~~~~
          |            _InterlockedExchange
      atomic_windows.h: In function '__msvc_xchg_i16':
      atomic_windows.h:107:12: error: implicit declaration of function '_InterlockedExchange16'; did you mean '_InterlockedExchange'? [-Werror=implicit-function-declaration]
      107 |     return _InterlockedExchange16(addr, val);
          |            ^~~~~~~~~~~~~~~~~~~~~~
          |            _InterlockedExchange
      atomic_windows.h: In function '__msvc_xchg_i32':
      atomic_windows.h:111:33: warning: passing argument 1 of '_InterlockedExchange' from incompatible pointer type [-Wincompatible-pointer-types]
      111 |     return _InterlockedExchange(addr, val);
          |                                 ^~~~
          |                                 |
          |                                 volatile int *
      In file included from D:/yy/w64devkit/x86_64-w64-mingw32/include/winnt.h:27,
                     from D:/yy/w64devkit/x86_64-w64-mingw32/include/minwindef.h:163,
                     from D:/yy/w64devkit/x86_64-w64-mingw32/include/windef.h:9,
                     from D:/yy/w64devkit/x86_64-w64-mingw32/include/windows.h:69,
                     from atomic_windows.h:29:
      D:/yy/w64devkit/x86_64-w64-mingw32/include/psdk_inc/intrin-impl.h:1714:50: note: expected 'volatile long int *' but argument is of type 'volatile int *'
      1714 | __LONG32 _InterlockedExchange(__LONG32 volatile *Target, __LONG32 Value) {
          |                                                  ^
      atomic_windows.h: In function '__msvc_cmpxchg_i8':
      atomic_windows.h:186:12: error: implicit declaration of function '_InterlockedCompareExchange8'; did you mean '_InterlockedCompareExchange'? [-Werror=implicit-function-declaration]
      186 |     return _InterlockedCompareExchange8((__int8 volatile*)addr, newval, oldval);
          |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
          |            _InterlockedCompareExchange
      atomic_windows.h: In function '__msvc_cmpxchg_i32':
      atomic_windows.h:194:40: warning: passing argument 1 of '_InterlockedCompareExchange' from incompatible pointer type [-Wincompatible-pointer-types]
      194 |     return _InterlockedCompareExchange((__int32 volatile*)addr, newval, oldval);
          |                                        ^~~~~~~~~~~~~~~~~~~~~~~
          |                                        |
          |                                        volatile int *
      D:/yy/w64devkit/x86_64-w64-mingw32/include/psdk_inc/intrin-impl.h:1659:57: note: expected 'volatile long int *' but argument is of type 'volatile int *'
      1659 | __LONG32 _InterlockedCompareExchange(__LONG32 volatile *Destination, __LONG32 ExChange, __LONG32 Comperand) {
          |                                                         ^
      atomic_windows.h: In function '__msvc_xadd_i8':
      atomic_windows.h:279:12: error: implicit declaration of function '_InterlockedExchangeAdd8'; did you mean '_InterlockedExchangeAdd'? [-Werror=implicit-function-declaration]
      279 |     return _InterlockedExchangeAdd8(addr, val);
          |            ^~~~~~~~~~~~~~~~~~~~~~~~
          |            _InterlockedExchangeAdd
      atomic_windows.h: In function '__msvc_xadd_i16':
      atomic_windows.h:283:12: error: implicit declaration of function '_InterlockedExchangeAdd16'; did you mean '_InterlockedExchangeAdd'? [-Werror=implicit-function-declaration]
      283 |     return _InterlockedExchangeAdd16(addr, val);
          |            ^~~~~~~~~~~~~~~~~~~~~~~~~
          |            _InterlockedExchangeAdd
      atomic_windows.h: In function '__msvc_xadd_i32':
      atomic_windows.h:287:36: warning: passing argument 1 of '_InterlockedExchangeAdd' from incompatible pointer type [-Wincompatible-pointer-types]
      287 |     return _InterlockedExchangeAdd(addr, val);
          |                                    ^~~~
          |                                    |
          |                                    volatile int *
      D:/yy/w64devkit/x86_64-w64-mingw32/include/psdk_inc/intrin-impl.h:1648:53: note: expected 'volatile long int *' but argument is of type 'volatile int *'
      1648 | __LONG32 _InterlockedExchangeAdd(__LONG32 volatile *Addend, __LONG32 Value) {
          |                                                     ^
      In function 'ggml_op_name',
        inlined from 'ggml_get_n_tasks' at ggml.c:16954:17:
      ggml.c:2004:24: warning: array subscript 70 is above array bounds of 'const char *[69]' [-Warray-bounds=]
      2004 |     return GGML_OP_NAME[op];
          |            ~~~~~~~~~~~~^~~~
      ggml.c: In function 'ggml_get_n_tasks':
      ggml.c:1586:21: note: while referencing 'GGML_OP_NAME'
      1586 | static const char * GGML_OP_NAME[GGML_OP_COUNT] = {
          |                     ^~~~~~~~~~~~
      In function 'ggml_compute_forward_add_f32',
        inlined from 'ggml_compute_forward_add' at ggml.c:7262:17:
      ggml.c:6995:40: warning: 'ft' may be used uninitialized [-Wmaybe-uninitialized]
      6995 |                         dst_ptr[i] = ft[i] >= 0.0f ? src0_ptr[i] + src1_ptr[i] : 0;
          |                                        ^
      ggml.c: In function 'ggml_compute_forward_add':
      ggml.c:6960:12: note: 'ft' was declared here
      6960 |     float *ft;
          |            ^~
      cc1.exe: some warnings being treated as errors
      make: *** [Makefile:533: ggml.o] Error 1
aoguai commented 8 months ago

I believe I have found a solution to the issue:

You can refer to the following Stack Overflow post for more details on the CUDA compilation issue on Windows with CMake error "No CUDA Toolset": c++ - CUDA compile problems on Windows, CMake error: no CUDA toolset found - Stack Overflow

This problem usually occurs because the Visual Studio Integration is missing when installing CUDA. Here's what I did:

  1. Navigate to the installation directory of your CUDA, for example: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\extras\visual_studio_integration\MSBuildExtensions

  2. Find these four files:

    • CUDA 11.7.props
    • CUDA 11.7.targets
    • CUDA 11.7.xml
    • Nvda.Build.CudaTasks.v11.7.dll
  3. Copy and replace them in the corresponding paths under Visual Studio:

    • C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations
    • C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations

Make sure to adjust the paths to your CUDA installation and Visual Studio directories, and remember to create backups.

After these steps, the issue should be resolved.


Date and Time: 2024-01-19 21:41 (Edited)

Environment: Windows

Hardware Configuration:

Run Output:

llm_load_gpu_split: offloaded 0.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  256.00 MB
llama_build_graph: non-view tensors processed: 548/836
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 6.91 MB
llama_new_context_with_model: VRAM scratch buffer: 5.34 MB
llama_new_context_with_model: total VRAM used: 3269.75 MB (model: 3264.41 MB, context: 5.34 MB)

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 32, n_predict = 128, n_keep = 0

Once upon a time there lived three brothers: Hodja, Sinan and Ali. It is told that these three men were very wise and clever, but the only one who was wiser than them all was their father.
Their father was so wise that he could tell what people would do before they did it. This knowledge made him famous all over the world. People came to him from every corner of the earth asking for his advice and guidance. Every day, when these three brothers went to school, they were always very hungry because they had nothing to eat at home.
One night, their father gave each boy a walnut.
llama_print_timings:        load time =   14126.55 ms
llama_print_timings:      sample time =      35.82 ms /   128 runs   (    0.28 ms per token,  3573.42 tokens per second)
llama_print_timings: prompt eval time =   10247.01 ms /     5 tokens ( 2049.40 ms per token,     0.49 tokens per second)
llama_print_timings:        eval time =   88055.01 ms /   127 runs   (  693.35 ms per token,     1.44 tokens per second)
llama_print_timings:       total time =  100799.35 ms
Log end

It's great for me

hodlen commented 8 months ago

Thanks @aoguai for your informative reply!

We also encountered this issue in dev and managed to fix it by removing all whitespace for every CUDA environment variable. Like replacing C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\... with C:\ProgramFiles\NVIDIAGPUComputing Toolkit\CUDA\.... If your CUDA toolkit is properly installed and still struggle with this issue, please give it a try!

c469591 commented 8 months ago

Thank you everyone, my issue has been successfully resolved. Can this project be modified to use an interactive dialogue chat mode for inference? Although I can infer smoothly at the moment, each time I need to re-enter a complete inference command, and the output of the inference seems incomplete and even includes some other evaluation outputs. Are there any other projects that have already applied this one to create a chat tool that general users can directly use? Can this project be made to continue running instead of exiting immediately after the inference is complete? Thanks!

hodlen commented 8 months ago

There are various ways to chat with these models interactively, and the simplest one is to start a server (see examples/server). It provides a simple web UI to chat with and matches you demand. Please kindly refer to #126.