Inference on nvidia gpu

Louis-y-nlp commented 1 year ago

Thanks for your great work. Im running a mpt model with nvidia v100 gpu. I think the compilation process went well, but GPU cannot be utilized during inference. Here is what i got

cmake -D CMAKE_C_COMPILER=/usr/local/bin/gcc -D CMAKE_CXX_COMPILER=/usr/local/bin/g++ -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc ..
-- The C compiler identification is GNU 8.5.0
-- The CXX compiler identification is GNU 8.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/local/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/local/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find Git (missing: GIT_EXECUTABLE) 
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Linux detected
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.0.221") 
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 11.0.221
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- GGML CUDA sources found, configuring CUDA architecture
-- x86 detected
-- Linux detected
-- Configuring done (3.0s)
-- Generating done (0.1s)
-- Build files have been written to: /home/work/data/codes/ggml/build

then

make -j4 mpt
[ 11%] Building C object src/CMakeFiles/ggml.dir/ggml.c.o
[ 22%] Building CUDA object src/CMakeFiles/ggml.dir/ggml-cuda.cu.o
[ 33%] Building CXX object examples/CMakeFiles/common.dir/common.cpp.o
/home/work/data/codes/ggml/src/ggml.c: In function ‘ggml_compute_forward_win_part_f32’:
/home/work/data/codes/ggml/src/ggml.c:13064:19: warning: unused variable ‘ne3’ [-Wunused-variable]
     const int64_t ne3 = dst->ne[3];
                   ^~~
[ 44%] Linking CUDA static library libggml.a
[ 44%] Built target ggml
[ 55%] Building CXX object examples/CMakeFiles/common-ggml.dir/common-ggml.cpp.o
[ 66%] Linking CXX static library libcommon.a
[ 66%] Built target common
[ 77%] Linking CXX static library libcommon-ggml.a
[ 77%] Built target common-ggml
[ 88%] Building CXX object examples/mpt/CMakeFiles/mpt.dir/main.cpp.o
[100%] Linking CXX executable ../../bin/mpt
[100%] Built target mpt

when i run, i got this output

CUDA_VISIBLE_DEVICES=1 ./bin/mpt -m /home/work/mosaicml_mpt-7b-instruct/ggml-model-f16.bin -p "This is an example"
main: seed      = 1685967171
main: n_threads = 4
main: n_batch   = 8
main: n_ctx     = 512
main: n_predict = 200

mpt_model_load: loading model from '/home/work/mosaicml_mpt-7b-instruct/ggml-model-f16.bin' - please wait ...
mpt_model_load: d_model        = 4096
mpt_model_load: max_seq_len    = 2048
mpt_model_load: n_ctx          = 512
mpt_model_load: n_heads        = 32
mpt_model_load: n_layers       = 32
mpt_model_load: n_vocab        = 50432
mpt_model_load: alibi_bias_max = 8.000000
mpt_model_load: clip_qkv       = 0.000000
mpt_model_load: ftype          = 1
mpt_model_load: qntvr          = 0
mpt_model_load: ggml ctx size = 12939.11 MB
mpt_model_load: memory_size =   256.00 MB, n_mem = 16384
mpt_model_load: ........................ done
mpt_model_load: model size = 12683.02 MB / num tensors = 194
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.

main: temp           = 0.800
main: top_k          = 50432
main: top_p          = 1.000
main: repeat_last_n  = 64
main: repeat_penalty = 1.020

main: number of tokens in prompt = 4
main: token[0] =   1552
main: token[1] =    310
main: token[2] =    271
main: token[3] =   1650

This is an example of a three-year warrantyThis product is covered by a three-year warranty.<|endoftext|>

main: sampled tokens =       18
main:  mem per token =   339828 bytes
main:      load time = 11339.55 ms
main:    sample time =   204.11 ms / 11.34 ms per token
main:      eval time = 11117.53 ms / 529.41 ms per token
main:     total time = 29913.26 ms

During runtime, I repeatedly checked and found that the GPU was not utilized at all. If I accidentally missed something, please let me know.

Louis-y-nlp commented 1 year ago

When I repeated the above steps, I found that the "./bin/mpt" process occupied 429MiB of GPU memory. However, during runtime, the GPU utilization remained at 0%, and the speed was consistent with the CPU, at 500ms per token.

canyonrobins commented 1 year ago

I'm seeing the same behavior

ggerganov commented 1 year ago

The tensors need to be offloaded to the GPU. You can look at llama.cpp for a demo how to make it. In the future, we will try to make it more seamless

rmc135 commented 1 year ago

I also had some frustration with GPU support and couldn't figure out why it didn't seem to do anything with the GPU, aside from consuming a small amount of VRAM each run.

Looking closer at the source, it turns out that most model handlers do not actually have any cuda support - it's built but not linked, and using -ngl on the commandline is accepted but completely ignored.

Is there a roadmap on adding proper support? At the moment the only handler which seems to provide CUDA support is starcoder.

ggerganov / ggml

Inference on nvidia gpu #230