Open 51-matt opened 11 months ago
Command1 : ./main -m /MYPATH/ggml-model-q4_0.bin --color -p "MYQUESTION" -n 256 -ngl 45 --in-prefix
However, when I use the LlamaCpp model with GPU acceleration, it shows a lower speed.
Command2 : callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]) llm = LlamaCpp( model_path="/MYPATH/ggml-model-q4_0.bin", n_gpu_layers=45, n_batch=512, max_length=1024, n_ctx=1024, callback_manager=callback_manager, verbose=True ) llm.predict("MYQUESTION")
I'm wondering if there might be any possible reasons for this discrepancy. Can you help me identify and correct the issue?
check https://github.com/abetlen/llama-cpp-python/issues/695#issuecomment-1869176032 I had the same problem llama-cpp-python dont use gpu, you can check if used with nvtop
Hi, I have an issue related to GPU acceleration. When I execute the following command, the GPU does not work on 2.
Command1 : ./main -m /MYPATH/ggml-model-q4_0.bin --color -p "MYQUESTION" -n 256 -ngl 45 --in-prefix
Result1 : blas=1 (80 tokens/s)
However, when I use the LlamaCpp model with GPU acceleration, it shows a lower speed.
Command2 : callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]) llm = LlamaCpp( model_path="/MYPATH/ggml-model-q4_0.bin", n_gpu_layers=45, n_batch=512, max_length=1024, n_ctx=1024, callback_manager=callback_manager, verbose=True ) llm.predict("MYQUESTION")
Result2 : blas=0 (7.8 tokens/s)
I'm wondering if there might be any possible reasons for this discrepancy. Can you help me identify and correct the issue?