Response time too slow - GPU is not being used

kpratik41 commented 4 months ago

Hello,

Is this code written to run only on CPU. I dont think GPU is being used and the response time is very slow.

If it is written to be run on CPU for now then can you suggest the changes (device, gpu_layers) that I would need to make to make it run on GPU?

kpratik41 commented 4 months ago

I saw your response to other questions and figured that mistral has 32 layers. (set gpu_layers=32). Let me first investigate what other parameters need to be changed to make it run on GPU. If i still have questions then I will get back to you. Thanks

JTMarsh556 commented 3 months ago

I set the GPU layers to 32 (gpu_layers=32) and I get CUDA error 700. It says illegal memory access was encountered. It never loads the prompt.

Update: Alright, so I played with the layers a bit. The illegal memory access comes and goes. I need to look into what exactly is going on there. I did find that I couldn't support 32 layers. I was maxing out vram on the initial startup. The max I got was 24 layers but it wasn't consistent. I have had illegal memory access terminate my sessions at 12 layers.

Update 2: I couldn't get it stable with even 4 layers so I set it back to zero. Is the memory ever getting released?

Shmoji commented 3 months ago

setting gpu_layers=32 or anything other than 0 causes errors for me too. I cannot get the GPU to be used

Leon-Sander commented 3 months ago

Is the memory ever getting released?

If running in a stable manner, then the model stays in memory. You can click on the clear cache button to release the model. Also usually it gets released if the code crashes. You can easily verify it by checking nvidia-smi on linux or checking resources in windows. How much vram do you have available?

setting gpu_layers=32 or anything other than 0 causes errors for me too. I cannot get the GPU to be used

Maybe installing pip install ctransformers[cuda] might help. Which errors are you encountering?

JTMarsh556 commented 3 months ago

Unfortunately I only have 16GB but I can get GPU offload on the same models in PrivateGPT and LM Studio so while I would like to have more I think there is a way to manage it. I just don't know enough yet to make it happen.

Leon-Sander commented 3 months ago

Well I have 8GB and it works on my end.

JTMarsh556 commented 3 months ago

no doubt in my mind it is something on my end. I just installed ctransformers[cuda] as you suggested and I didnt have nvida-cublas-cu12 so that may have played a bit in it. I will follow up after I have had a chance to determine if that fixed the errors I was seeing. Thank you for the suggestion

JTMarsh556 commented 3 months ago

Seems to be working a lot better. I started with 16 layers and it has made it through what was tanking me before. I'll work my way up and report back if I come across any of the same issues. Thank you Leon.

Leon-Sander commented 3 months ago

You're welcome. Which OS are you actually on?

JTMarsh556 commented 3 months ago

Windows at the moment but I will eventually clone/rebuild this in a Debian system. I really appreciate you putting all of this together and taking the time to produce the YT video in a code along structure. This is the densest digestible project I have come across and I have learned a lot in a very short amount of time.

Leon-Sander commented 3 months ago

I also have struggles on windows but could not identify for sure where the exact problem is.

I really appreciate you putting all of this together and taking the time to produce the YT video in a code along structure. This is the densest digestible project I have come across and I have learned a lot in a very short amount of time.

Thank you, glad I could help.

Shmoji commented 3 months ago

after doing pip install ctransformers[cuda] it worked in windows cmd for a tiny PDF. still didnt work in WSL though. still surprisingly slow for tiny PDF and only utilizes around 30% of GPU with gpu_layers set to 50. I have 16GB VRAM and 128GB RAM.

also wanted to note: im currently trying to load a short book in a PDF, but it's been loading for over 30 minutes with no success. It's only 8MB

But the tiny PDF does work! Thanks!

DhruvDhabalia commented 3 months ago

i ran 'pip install ctransformers[cuda]' and also set gpu layers to 24...i am running it on my dedicated gpu that is rtx 4050...i am still getting a slow response time..

Leon-Sander commented 3 months ago

@DhruvDhabalia are you on Windows? For some reason I also have a very slow response time on windows, more investigation is necessary to find the problem area. No problem on linux tho.

DhruvDhabalia commented 3 months ago

@DhruvDhabalia are you on Windows? For some reason I also have a very slow response time on windows, more investigation is necessary to find the problem area. No problem on linux tho.

yes i am on windows..for some reason it seems like the layers and dgpu have no effect on the model flow speed...possibly smthng in code is restricting everything on windows update: i am trying to run profiling tools to see the hiccup

DhruvDhabalia commented 3 months ago

@DhruvDhabalia are you on Windows? For some reason I also have a very slow response time on windows, more investigation is necessary to find the problem area. No problem on linux tho.

yes i am on windows..for some reason it seems like the layers and dgpu have no effect on the model flow speed...possibly smthng in code is restricting everything on windows update: i am trying to run profiling tools to see the hiccup

I ran the profiler....max time is consumed by main() in app.py and run() in llm_chains.py...i dont know if this will help..either ways i am attaching the output ss Screenshot 2024-03-27 174009

ItsIgnis commented 1 week ago

after doing pip install ctransformers[cuda] it worked in windows cmd for a tiny PDF. still didnt work in WSL though. still surprisingly slow for tiny PDF and only utilizes around 30% of GPU with gpu_layers set to 50. I have 16GB VRAM and 128GB RAM.

also wanted to note: im currently trying to load a short book in a PDF, but it's been loading for over 30 minutes with no success. It's only 8MB

But the tiny PDF does work! Thanks!

im also facing simmilar issue , did this problem get resolved . loading pdf books and not utilizing full gpu

Leon-Sander / local_multimodal_ai_chat

Response time too slow - GPU is not being used #12