SeargeDP / ComfyUI_Searge_LLM

Custom nodes for ComfyUI that utilize a language model to generate text-to-image prompts
MIT License
27 stars 1 forks source link

Slow - ~5 min per generation #6

Open jnpatrick99 opened 2 weeks ago

jnpatrick99 commented 2 weeks ago

Any ideas, why it can be slow? For example I'm using KoboldCPP with the same Mistral model and it answers immediately in realtime, almost like ChatGPT (I have RTX 4090). It also starts in like 15 seconds. But for Searge_LLM it takes 5 minutes for consecutive calls (when model is already loaded). There are no errors, nothing in the output, I'm on Windows. Thanks!

SeargeDP commented 2 weeks ago

Make sure llama-cpp is compiled with cuda support if you are installing it manually. Google can help with information about how to do that. By default I have specified a pre-compiled version of llama-cpp with cuda support in the requirements.txt file.

I had the same problem during testing when I installed llama-cpp manually and didn't realize that by default it compiles for cpu and not for cuda.

jnpatrick99 commented 2 weeks ago

Make sure llama-cpp is compiled with cuda support if you are installing it manually. Google can help with information about how to do that. By default I have specified a pre-compiled version of llama-cpp with cuda support in the requirements.txt file.

I had the same problem during testing when I installed llama-cpp manually and didn't realize that by default it compiles for cpu and not for cuda.

I tried to install your precompiled version from requirements and now your node doesn't start: raise RuntimeError(f"Failed to load shared library '{_lib_path}': {e}") RuntimeError: Failed to load shared library '...\Lib\site-packages\llama_cpp_cuda\lib\llama.dll': [WinError 127] The specified procedure could not be found

jnpatrick99 commented 2 weeks ago

Actually the problem somehow fixed itself after 2-3 restarts of ComfyUI.

It is better, thanks for the advice!. The speed is improved - 2 minute per one answer (btw I'm using Mistral Nemo 12B with context window of 8192, 24GB VRAM on RTX 4090) vs 5 min before. But koboldcpp still does it much faster - I get 15-20 seconds max on long answers.

Also frequently there's CUDA error: out of memory and node just freezes and then there's python crash. So is there a way clean up memory and also specify context window (as I saw it's hardcoded to 2048). Thanks again!

SeargeDP commented 1 week ago

After reading what you're trying to set up I'm not sure if this is the right node for your use cases. You don't really gain much from using a larger model like 12B or longer context window if it comes to creating or improving prompts.

And 2 minutes on a 4090, that sounds like something is still not working correctly. It takes 3-4 seconds on my RTX4080/16GB to run the node.

It seems that you need a more flexible LLM integration. Take a look at https://github.com/Big-Idea-Technology/ComfyUI_LLM_Node which is the node that I used as base to create this one, it's potentially closer to your use cases.