How to speed up answer time ?

PromtEngineer / localGPT

Chat with your documents on your local device using GPT models. No data leaves your device and 100% private.

Apache License 2.0

19.77k stars 2.2k forks source link

How to speed up answer time ? #537

Open boral opened 11 months ago

boral commented 11 months ago

It takes almost 10 mins to get an answer. How to speed it up ? I am using US Constitution file as demo.

Jane1702 commented 11 months ago

Which LLM model did you use ? For me , when I use llama-7b , it takes 7 mins , llama-13b 10 mins and can't have answer for llama-70b ( i have been waiting for more than 2 hours )

boral commented 11 months ago

I am using TheBloke/Llama-2-7b-Chat-GGUF. Isn't 7 mins long for answering. I am running on cuda but doesn't seem to have any effect on it.

nabeelbukhari commented 11 months ago

@boral I had same experience on RTX 3070-Ti, after digging into code and trying some test codes, I found out that llama-cpp-python was not working with GPU correctly.

I created the batch script below to refresh it with latest installation and it worked fine now.

pip uninstall -y llama-cpp-python
set LLAMA_CUBLAS=1
set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1
pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir  --verbose

anand-shubham commented 11 months ago

Got the above error, did you encounter this error? or any fix for the same?

nidhi-chipre commented 8 months ago

I am using TheBloke/WizardLM-7B-uncensored-GPTQ model with Tesla T4 GPU on a Linux VM, it gives response within 10 to 30 seconds for small questions like "name of the document,etc" but if we ask questions like summarize the document it will take from 2 to 3 minutes. how to increase the speed