LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.66k stars 334 forks source link

Error building with MAKE_CUBLAS #91

Closed horenbergerb closed 1 year ago

horenbergerb commented 1 year ago

Ubuntu 22.04.2 LTS with Nvidia 3060 ti GPU. CUDA version given by

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Using llama.cpp, make clean && LLAMA_CUBLAS=1 make succeeds and seems to work; the prompt loads faster than building without LLAMA_CUBLAS.

However, for koboldcpp, I'm getting some errors that seem to relate to finding the CUDA/CUBLAS libraries:

ggml.c:7748:74: error: ‘cudaMemcpyHostToDevice’ undeclared (first use in this function)              
 7748 |    CUDA_CHECK(cudaMemcpyAsync(d_X, x, sizeof(float) * x_ne, cudaMemcpyHostToDevice, cudaStrea
m));                                                                                                 
      |                                                             ^~~~~~~~~~~~~~~~~~~~~~                                                                                                                
ggml.c:7748:74: note: each undeclared identifier is reported only once for each function it appears i
n                                                                                                    ggml.c:7748:98: error: ‘cudaStream’ undeclared (first use in this function)                          
 7748 | Async(d_X, x, sizeof(float) * x_ne, cudaMemcpyHostToDevice, cudaStream));                    
...
ggml.c:7753:33: error: ‘cublasH’ undeclared (first use in this function)                             
 7753 |                     cublasSgemm(cublasH, CUBLAS_OP_T, CUBLAS_OP_N,                           
      |                                 ^~~~~~~                                                      
ggml.c:7753:42: error: ‘CUBLAS_OP_T’ undeclared (first use in this function)                         
 7753 |                     cublasSgemm(cublasH, CUBLAS_OP_T, CUBLAS_OP_N,                           
      |                                          ^~~~~~~~~~~                                         
ggml.c:7753:55: error: ‘CUBLAS_OP_N’ undeclared (first use in this function)                          7753 |                     cublasSgemm(cublasH, CUBLAS_OP_T, CUBLAS_OP_N,                           
      |                                                       ^~~~~~~~~~~                            
ggml.c:7760:74: error: ‘cudaMemcpyDeviceToHost’ undeclared (first use in this function)              
 7760 |    CUDA_CHECK(cudaMemcpyAsync(d, d_D, sizeof(float) * d_ne, cudaMemcpyDeviceToHost, cudaStrea
m));                                                                                                 
      |                                                             ^~~~~~~~~~~~~~~~~~~~~~           
....

Not sure why I'm hitting this with koboldcpp but not with llama.cpp. Any ideas what's up?

horenbergerb commented 1 year ago

I just did a comparison in llama.cpp of LLAMA_OPENBLAS vs LLAMA_CUBLAS:

W/ Cublas

llama_print_timings:        load time = 20868.04 ms
llama_print_timings:    sample time =   17.93 ms /  50 runs   ( 0.36 ms per run)
llama_print_timings: prompt eval time = 33592.02 ms /   813 tokens (   41.32 ms per token)
llama_print_timings:        eval time = 18746.89 ms /   49 runs   (  382.59 ms per run)
llama_print_timings:    total time = 55405.74 ms

W/ Openblas

llama_print_timings:        load time = 39263.11 ms
llama_print_timings:    sample time =   18.26 ms /  50 runs   ( 0.37 ms per run)
llama_print_timings: prompt eval time = 64053.96 ms /   813 tokens (   78.79 ms per token)
llama_print_timings:        eval time = 18722.13 ms /   49 runs   (  382.08 ms per run)
llama_print_timings:    total time = 85293.59 ms

With small prompts I couldn't see a difference, but with a larger prompt it seems fairly clear that Cublas is working (almost 2x speedup on prompt eval time.)

LostRuins commented 1 year ago

@horenbergerb how does it compare with CLBlast? The main issue with Cublas is the requirement for huge libraries just to get it working - I presume you had to install about 3gb worth of CUDA gubbins?

horenbergerb commented 1 year ago

Here's the same test running from koboldcpp after make main LLAMA_CLBLAST=1:

W/ CLBlast

llama_print_timings:        load time = 20971.86 ms
llama_print_timings:      sample time =    18.19 ms /    50 runs   (    0.36 ms per run)
llama_print_timings: prompt eval time = 30499.88 ms /   813 tokens (   37.52 ms per token)
llama_print_timings:        eval time = 18636.46 ms /    49 runs   (  380.34 ms per run)

This is probably just evidence that I'm doing something wrong, but it seems that in my case I don't stand to gain much from cublast. Regarding the CUDA libraries, I would believe it but I wouldn't know. My rig is practically built around accommodating CUDA and NVIDIA at this point.

Alice-WT commented 1 year ago

@LostRuins In my case it would be appreciated because I can't get OpenCL to work on my system. I was able to get the cuBlas code in ggml working by changing the Makefile and got ~15x speedup. It only uses the GPU for prompt processing, so I'm not sure if there's something that I did wrong that prevents it from running on the output generation.

LostRuins commented 1 year ago

CLBlast should give pretty comparable speeds to cublas now, with the improvements all added in. If you're on windows, it should work out of the box. On linux, you will need to run make LLAMA_CLBLAST=1.

When running use the flag --useclblast [device] [platform] , you might have to do some trial and error from device/platform to determine which is the right GPU.