Open Louis-y-nlp opened 1 year ago
When I repeated the above steps, I found that the "./bin/mpt" process occupied 429MiB of GPU memory. However, during runtime, the GPU utilization remained at 0%, and the speed was consistent with the CPU, at 500ms per token.
I'm seeing the same behavior
The tensors need to be offloaded to the GPU.
You can look at llama.cpp
for a demo how to make it.
In the future, we will try to make it more seamless
I also had some frustration with GPU support and couldn't figure out why it didn't seem to do anything with the GPU, aside from consuming a small amount of VRAM each run.
Looking closer at the source, it turns out that most model handlers do not actually have any cuda support - it's built but not linked, and using -ngl on the commandline is accepted but completely ignored.
Is there a roadmap on adding proper support? At the moment the only handler which seems to provide CUDA support is starcoder.
Thanks for your great work. Im running a mpt model with nvidia v100 gpu. I think the compilation process went well, but GPU cannot be utilized during inference. Here is what i got
then
when i run, i got this output
During runtime, I repeatedly checked and found that the GPU was not utilized at all. If I accidentally missed something, please let me know.