Inference starcoder(4bit\8bit) with GPU

curname commented 1 year ago

First of all, thank you for your work! I used ggml to quantize the starcoder model to 8bit (4bit), but I encountered difficulties when using GPU for inference. If you can provide me with an example, I would be very grateful.

johnson442 commented 1 year ago

./bin/starcoder-mmap -m /models/WizardCoder-15B-1.0.ggmlv3.q5_1.bin -ngl 20 -p "def fibonacci(n):"
main: seed = 1690402839
starcoder_model_load: loading model from '/models/WizardCoder-15B-1.0.ggmlv3.q5_1.bin'
starcoder_model_load: n_vocab = 49153
starcoder_model_load: n_ctx   = 8192
starcoder_model_load: n_embd  = 6144
starcoder_model_load: n_head  = 48
starcoder_model_load: n_layer = 40
starcoder_model_load: ftype   = 2009
starcoder_model_load: qntvr   = 2
starcoder_model_load: ggml map size = 13596.73 MB
starcoder_model_load: ggml ctx size =   0.24 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1
starcoder_model_load: kv_cache memory size =  7680.00 MB, n_mem = 327680
starcoder_model_load: model size  = 13596.24 MB
starcoder_model_load: [cublas] offloading 20 layers to GPU
starcoder_model_load: [cublas] total VRAM used: 6480 MB
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: prompt: 'def fibonacci(n):'
main: number of tokens in prompt = 6
main: token[0] =    589, def
main: token[1] =  28176,  fib
main: token[2] =  34682, onacci
main: token[3] =     26, (
main: token[4] =     96, n
main: token[5] =    711, ):

Calling starcoder_eval
def fibonacci(n):
    if n <= 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

n = int(input("Enter a positive integer: "))
if n < 0:
    print("Invalid input!")
else:
    print("The", n, "th Fibonacci number is:", fibonacci(n))<|endoftext|>

main: mem per token =   460268 bytes
main:     load time =  2807.24 ms
main:   sample time =    24.13 ms
main:  predict time = 21730.91 ms / 231.18 ms per token
main:    total time = 29279.35 ms

staviq commented 1 year ago

It's not working for me either.

#>cmake -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/opt/cuda/bin/nvcc ..
-- The C compiler identification is GNU 12.2.1
-- The CXX compiler identification is GNU 12.2.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.39.1") 
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Linux detected
-- Found CUDAToolkit: /opt/cuda/include (found version "12.2.91") 
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 12.2.91
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /opt/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- GGML CUDA sources found, configuring CUDA architecture
-- x86 detected
-- Linux detected
-- Configuring done
-- Generating done
-- Build files have been written to: /storage/ggml/build

And starcoder doesn't even try using the GPU:

./starcoder -m /storage/models/WizardCoder-15B-1.0.ggmlv3.q8_0.bin -p "sqrt(4)" -ngl 1   
main: seed = 1690477492
starcoder_model_load: loading model from '/storage/models/WizardCoder-15B-1.0.ggmlv3.q8_0.bin'
starcoder_model_load: n_vocab = 49153
starcoder_model_load: n_ctx   = 8192
starcoder_model_load: n_embd  = 6144
starcoder_model_load: n_head  = 48
starcoder_model_load: n_layer = 40
starcoder_model_load: ftype   = 2007
starcoder_model_load: qntvr   = 2
starcoder_model_load: ggml ctx size = 34536.48 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5
starcoder_model_load: memory size = 15360.00 MB, n_mem = 327680
starcoder_model_load: model size  = 19176.25 MB
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.

starcoder-mmap does this:

./starcoder-mmap -m /storage/models/WizardCoder-15B-1.0.ggmlv3.q8_0.bin -p "sqrt(4)" -ngl 1
main: seed = 1690477603
starcoder_model_load: loading model from '/storage/models/WizardCoder-15B-1.0.ggmlv3.q8_0.bin'
starcoder_model_load: n_vocab = 49153
starcoder_model_load: n_ctx   = 8192
starcoder_model_load: n_embd  = 6144
starcoder_model_load: n_head  = 48
starcoder_model_load: n_layer = 40
starcoder_model_load: ftype   = 2007
starcoder_model_load: qntvr   = 2
starcoder_model_load: ggml map size = 19176.73 MB
starcoder_model_load: ggml ctx size =   0.24 MB
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5
starcoder_model_load: kv_cache memory size =  7680.00 MB, n_mem = 327680
starcoder_model_load: model size  = 19176.25 MB
starcoder_model_load: [cublas] offloading 1 layers to GPU
starcoder_model_load: [cublas] total VRAM used: 459 MB
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: prompt: 'sqrt(4)'
main: number of tokens in prompt = 4
main: token[0] =   8663, sqrt
main: token[1] =     26, (
main: token[2] =     38, 4
main: token[3] =     27, )

Calling starcoder_eval
CUDA error 222 at /storage/ggml/src/ggml-cuda.cu:3509: the provided PTX was compiled with an unsupported toolchain.

EDIT: llamacpp works just fine for me though.

slaren commented 1 year ago

The CUDA backend requires some changes to the code to do full offloading, otherwise it is only used for multiplication of large matrices (generally that only happens when evaluating large prompts). It will be easier to use once we implement a common interface for all the backends, but it's going to take a while.

For example of how to use it, you can look into the llama.cpp source code. In the future, llama.cpp will also be extended to support other LLMs.

staviq commented 1 year ago

The CUDA backend requires some changes to the code to do full offloading, otherwise it is only used for multiplication of large matrices (generally that only happens when evaluating large prompts). It will be easier to use once we implement a common interface for all the backends, but it's going to take a while.

For example of how to use it, you can look into the llama.cpp source code. In the future, llama.cpp will also be extended to support other LLMs.

Thank you for the explaination.

ggerganov / ggml

Inference starcoder(4bit\8bit) with GPU #417