Closed wont-stream closed 5 months ago
Should probably specify -- Using text-generation-webui remotely on an RTX 3060 12GB VRAM Manjaro Linux
Please use the GBNF grammar for this model: https://github.com/acon96/home-llm/blob/develop/custom_components/llama_conversation/output.gbnf
It will force the model to produce the correct output, even when using quantized versions of the model. You can set it in the Parameters tab in text-generation-webui.
Should probably specify -- Using text-generation-webui remotely on an RTX 3060 12GB VRAM Manjaro Linux
I have that exact setup and I get that problem only if I try a different larger model or if I change the prompt. So I use characters in TextGen Webui and don't touch the default prompt and model.
Please use the GBNF grammar for this model: https://github.com/acon96/home-llm/blob/develop/custom_components/llama_conversation/output.gbnf
It will force the model to produce the correct output, even when using quantized versions of the model. You can set it in the Parameters tab in text-generation-webui.
Issue is slightly better, it likes to ramble though, not sure if that’s something in my config that’s doing that
Did you figure this out? I have the same issue.
Unfortunately not @xnaron
Did you try un-exposing all devices then re-exposing a few to test?
This may have been an issue with llama-cpp-python
not handling sampling parameters properly. The version I just pushed a build for (v0.2.42) should have this fixed and the develop
branch of text-generation-webui should have the fixed version now.
Did you try un-exposing all devices then re-exposing a few to test?
Yes I did that after posting and it did work. I removed all the exposed devices and have been adding back a few at a time. I am not sure if having hundreds of exposed devices was causing some type of context length issue.
Did you try un-exposing all devices then re-exposing a few to test?
Yes I did that after posting and it did work. I removed all the exposed devices and have been adding back a few at a time. I am not sure if having hundreds of exposed devices was causing some type of context length issue.
That makes sense. I believe most backends silently truncate the prompt once it gets too long for the model. Once this happens the model doesn't have the instructions at the start of the system prompt so it just outputs gibberish.
There were a few features added recently to restrict the number of chat turns to avoid going over the context length but in this case the system prompt is what is too long so that won't help.
I think there is a way to tokenize the prompt in advance to see if the prompt will be too big and warn the user. That or enforcing a max number of exposed devices that the model can support.
Any way to get models that support greater amounts of exposed devices? This would be useful for whole home setups -- In my case its just my bedroom and it struggles with that
Any way to get models that support greater amounts of exposed devices? This would be useful for whole home setups -- In my case its just my bedroom and it struggles with that
You can try to use the Self-Extend feature in llama.cpp. That should theoretically let you scale the model up to 8k tokens. That being said it will consume a TON of RAM (or VRAM) to calculate the larger context sizes because of the whole O(n^2)
memory requirement for attention. My napkin math says it would take around 55GB of memory to do the 8k context size for the 3B model. I haven't messed with it since the higher VRAM consumption is way to high for my GPU.
I'm going to go ahead and close this now that v0.2.12 has support for warning the user if they are exceeding the context length set by the user.
Not sure if anyone else is having this issue, I've tried two different models
Although it is still able to execute functions which is nice.
If anyone has ideas I'm open :)