Open c469591 opened 10 months ago
I encountered the same issue as well.
Using CMake:
Cloned PowerInfer repository:
git clone https://github.com/bobozi-cmd/PowerInfer
cd PowerInfer
Installed dependencies:
pip install -r requirements.txt
Ran CMake configuration:
cmake -S . -B build -DLLAMA_CUBLAS=ON
Error Encountered:
CMake Error at D:/yy/CMake/share/cmake-3.28/Modules/CMakeDetermineCompilerId.cmake:529 (message):
No CUDA toolset found.
Call Stack (most recent call first):
D:/yy/CMake/share/cmake-3.28/Modules/CMakeDetermineCompilerId.cmake:8 (CMAKE_DETERMINE_COMPILER_ID_BUILD)
D:/yy/CMake/share/cmake-3.28/Modules/CMakeDetermineCompilerId.cmake:53 (__determine_compiler_id_test)
D:/yy/CMake/share/cmake-3.28/Modules/CMakeDetermineCUDACompiler.cmake:135 (CMAKE_DETERMINE_COMPILER_ID)
CMakeLists.txt:258 (enable_language)
-- Configuring incomplete, errors occurred!
Using w64devkit:
Downloaded the latest Fortran version of w64devkit.
Executed w64devkit:
w64devkit.exe
Navigated to PowerInfer folder:
cd PowerInfer
Attempted to build using make
.
Error Encountered:
In file included from ggml.h:217,
from ggml-impl.h:3,
from ggml.c:4:
atomic_windows.h: In function '__msvc_xchg_i8':
atomic_windows.h:103:12: error: implicit declaration of function '_InterlockedExchange8'; did you mean '_InterlockedExchange'? [-Werror=implicit-function-declaration]
103 | return _InterlockedExchange8(addr, val);
| ^~~~~~~~~~~~~~~~~~~~~
| _InterlockedExchange
atomic_windows.h: In function '__msvc_xchg_i16':
atomic_windows.h:107:12: error: implicit declaration of function '_InterlockedExchange16'; did you mean '_InterlockedExchange'? [-Werror=implicit-function-declaration]
107 | return _InterlockedExchange16(addr, val);
| ^~~~~~~~~~~~~~~~~~~~~~
| _InterlockedExchange
atomic_windows.h: In function '__msvc_xchg_i32':
atomic_windows.h:111:33: warning: passing argument 1 of '_InterlockedExchange' from incompatible pointer type [-Wincompatible-pointer-types]
111 | return _InterlockedExchange(addr, val);
| ^~~~
| |
| volatile int *
In file included from D:/yy/w64devkit/x86_64-w64-mingw32/include/winnt.h:27,
from D:/yy/w64devkit/x86_64-w64-mingw32/include/minwindef.h:163,
from D:/yy/w64devkit/x86_64-w64-mingw32/include/windef.h:9,
from D:/yy/w64devkit/x86_64-w64-mingw32/include/windows.h:69,
from atomic_windows.h:29:
D:/yy/w64devkit/x86_64-w64-mingw32/include/psdk_inc/intrin-impl.h:1714:50: note: expected 'volatile long int *' but argument is of type 'volatile int *'
1714 | __LONG32 _InterlockedExchange(__LONG32 volatile *Target, __LONG32 Value) {
| ^
atomic_windows.h: In function '__msvc_cmpxchg_i8':
atomic_windows.h:186:12: error: implicit declaration of function '_InterlockedCompareExchange8'; did you mean '_InterlockedCompareExchange'? [-Werror=implicit-function-declaration]
186 | return _InterlockedCompareExchange8((__int8 volatile*)addr, newval, oldval);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
| _InterlockedCompareExchange
atomic_windows.h: In function '__msvc_cmpxchg_i32':
atomic_windows.h:194:40: warning: passing argument 1 of '_InterlockedCompareExchange' from incompatible pointer type [-Wincompatible-pointer-types]
194 | return _InterlockedCompareExchange((__int32 volatile*)addr, newval, oldval);
| ^~~~~~~~~~~~~~~~~~~~~~~
| |
| volatile int *
D:/yy/w64devkit/x86_64-w64-mingw32/include/psdk_inc/intrin-impl.h:1659:57: note: expected 'volatile long int *' but argument is of type 'volatile int *'
1659 | __LONG32 _InterlockedCompareExchange(__LONG32 volatile *Destination, __LONG32 ExChange, __LONG32 Comperand) {
| ^
atomic_windows.h: In function '__msvc_xadd_i8':
atomic_windows.h:279:12: error: implicit declaration of function '_InterlockedExchangeAdd8'; did you mean '_InterlockedExchangeAdd'? [-Werror=implicit-function-declaration]
279 | return _InterlockedExchangeAdd8(addr, val);
| ^~~~~~~~~~~~~~~~~~~~~~~~
| _InterlockedExchangeAdd
atomic_windows.h: In function '__msvc_xadd_i16':
atomic_windows.h:283:12: error: implicit declaration of function '_InterlockedExchangeAdd16'; did you mean '_InterlockedExchangeAdd'? [-Werror=implicit-function-declaration]
283 | return _InterlockedExchangeAdd16(addr, val);
| ^~~~~~~~~~~~~~~~~~~~~~~~~
| _InterlockedExchangeAdd
atomic_windows.h: In function '__msvc_xadd_i32':
atomic_windows.h:287:36: warning: passing argument 1 of '_InterlockedExchangeAdd' from incompatible pointer type [-Wincompatible-pointer-types]
287 | return _InterlockedExchangeAdd(addr, val);
| ^~~~
| |
| volatile int *
D:/yy/w64devkit/x86_64-w64-mingw32/include/psdk_inc/intrin-impl.h:1648:53: note: expected 'volatile long int *' but argument is of type 'volatile int *'
1648 | __LONG32 _InterlockedExchangeAdd(__LONG32 volatile *Addend, __LONG32 Value) {
| ^
In function 'ggml_op_name',
inlined from 'ggml_get_n_tasks' at ggml.c:16954:17:
ggml.c:2004:24: warning: array subscript 70 is above array bounds of 'const char *[69]' [-Warray-bounds=]
2004 | return GGML_OP_NAME[op];
| ~~~~~~~~~~~~^~~~
ggml.c: In function 'ggml_get_n_tasks':
ggml.c:1586:21: note: while referencing 'GGML_OP_NAME'
1586 | static const char * GGML_OP_NAME[GGML_OP_COUNT] = {
| ^~~~~~~~~~~~
In function 'ggml_compute_forward_add_f32',
inlined from 'ggml_compute_forward_add' at ggml.c:7262:17:
ggml.c:6995:40: warning: 'ft' may be used uninitialized [-Wmaybe-uninitialized]
6995 | dst_ptr[i] = ft[i] >= 0.0f ? src0_ptr[i] + src1_ptr[i] : 0;
| ^
ggml.c: In function 'ggml_compute_forward_add':
ggml.c:6960:12: note: 'ft' was declared here
6960 | float *ft;
| ^~
cc1.exe: some warnings being treated as errors
make: *** [Makefile:533: ggml.o] Error 1
I believe I have found a solution to the issue:
You can refer to the following Stack Overflow post for more details on the CUDA compilation issue on Windows with CMake error "No CUDA Toolset": c++ - CUDA compile problems on Windows, CMake error: no CUDA toolset found - Stack Overflow
This problem usually occurs because the Visual Studio Integration is missing when installing CUDA. Here's what I did:
Navigate to the installation directory of your CUDA, for example:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\extras\visual_studio_integration\MSBuildExtensions
Find these four files:
Copy and replace them in the corresponding paths under Visual Studio:
C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations
Make sure to adjust the paths to your CUDA installation and Visual Studio directories, and remember to create backups.
After these steps, the issue should be resolved.
Date and Time: 2024-01-19 21:41 (Edited)
Environment: Windows
Hardware Configuration:
ReluLLaMA-7B-PowerInfer-GGUF
Run Output:
llm_load_gpu_split: offloaded 0.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 256.00 MB
llama_build_graph: non-view tensors processed: 548/836
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 6.91 MB
llama_new_context_with_model: VRAM scratch buffer: 5.34 MB
llama_new_context_with_model: total VRAM used: 3269.75 MB (model: 3264.41 MB, context: 5.34 MB)
system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 32, n_predict = 128, n_keep = 0
Once upon a time there lived three brothers: Hodja, Sinan and Ali. It is told that these three men were very wise and clever, but the only one who was wiser than them all was their father.
Their father was so wise that he could tell what people would do before they did it. This knowledge made him famous all over the world. People came to him from every corner of the earth asking for his advice and guidance. Every day, when these three brothers went to school, they were always very hungry because they had nothing to eat at home.
One night, their father gave each boy a walnut.
llama_print_timings: load time = 14126.55 ms
llama_print_timings: sample time = 35.82 ms / 128 runs ( 0.28 ms per token, 3573.42 tokens per second)
llama_print_timings: prompt eval time = 10247.01 ms / 5 tokens ( 2049.40 ms per token, 0.49 tokens per second)
llama_print_timings: eval time = 88055.01 ms / 127 runs ( 693.35 ms per token, 1.44 tokens per second)
llama_print_timings: total time = 100799.35 ms
Log end
It's great for me
Thanks @aoguai for your informative reply!
We also encountered this issue in dev and managed to fix it by removing all whitespace for every CUDA environment variable. Like replacing C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\...
with C:\ProgramFiles\NVIDIAGPUComputing Toolkit\CUDA\...
. If your CUDA toolkit is properly installed and still struggle with this issue, please give it a try!
Thank you everyone, my issue has been successfully resolved. Can this project be modified to use an interactive dialogue chat mode for inference? Although I can infer smoothly at the moment, each time I need to re-enter a complete inference command, and the output of the inference seems incomplete and even includes some other evaluation outputs. Are there any other projects that have already applied this one to create a chat tool that general users can directly use? Can this project be made to continue running instead of exiting immediately after the inference is complete? Thanks!
There are various ways to chat with these models interactively, and the simplest one is to start a server (see examples/server
). It provides a simple web UI to chat with and matches you demand. Please kindly refer to #126.
Question Details
Hello, I encountered an error while using cmake. My system is Windows 10 with Python 3.11 and NVIDIA 3060. Below is the content of the error report. And I have correctly installed CUDA.
Additional Context
windows10 python3.11 NVIDIA3060 Today's cloned repository Today I installed the latest stable version of CMake.