Models repeat instructions

wont-stream commented 7 months ago

Not sure if anyone else is having this issue, I've tried two different models

Home-3B-v2.q8_0.gguf
Home-3B-v2.f16.gguf Tried both the default prompt, and a slightly modded prompt and it still repeats the instructions.

Although it is still able to execute functions which is nice.

If anyone has ideas I'm open :)

wont-stream commented 7 months ago

Should probably specify -- Using text-generation-webui remotely on an RTX 3060 12GB VRAM Manjaro Linux

acon96 commented 7 months ago

Please use the GBNF grammar for this model: https://github.com/acon96/home-llm/blob/develop/custom_components/llama_conversation/output.gbnf

It will force the model to produce the correct output, even when using quantized versions of the model. You can set it in the Parameters tab in text-generation-webui.

colaborat0r commented 7 months ago

Should probably specify -- Using text-generation-webui remotely on an RTX 3060 12GB VRAM Manjaro Linux

I have that exact setup and I get that problem only if I try a different larger model or if I change the prompt. So I use characters in TextGen Webui and don't touch the default prompt and model.

wont-stream commented 7 months ago

Please use the GBNF grammar for this model: https://github.com/acon96/home-llm/blob/develop/custom_components/llama_conversation/output.gbnf

It will force the model to produce the correct output, even when using quantized versions of the model. You can set it in the Parameters tab in text-generation-webui.

Issue is slightly better, it likes to ramble though, not sure if that’s something in my config that’s doing that

xnaron commented 7 months ago

Did you figure this out? I have the same issue.

wont-stream commented 7 months ago

Unfortunately not @xnaron

colaborat0r commented 7 months ago

Did you try un-exposing all devices then re-exposing a few to test?

acon96 commented 7 months ago

This may have been an issue with llama-cpp-python not handling sampling parameters properly. The version I just pushed a build for (v0.2.42) should have this fixed and the develop branch of text-generation-webui should have the fixed version now.

xnaron commented 7 months ago

Did you try un-exposing all devices then re-exposing a few to test?

Yes I did that after posting and it did work. I removed all the exposed devices and have been adding back a few at a time. I am not sure if having hundreds of exposed devices was causing some type of context length issue.

acon96 commented 7 months ago

Did you try un-exposing all devices then re-exposing a few to test?

Yes I did that after posting and it did work. I removed all the exposed devices and have been adding back a few at a time. I am not sure if having hundreds of exposed devices was causing some type of context length issue.

That makes sense. I believe most backends silently truncate the prompt once it gets too long for the model. Once this happens the model doesn't have the instructions at the start of the system prompt so it just outputs gibberish.

There were a few features added recently to restrict the number of chat turns to avoid going over the context length but in this case the system prompt is what is too long so that won't help.

I think there is a way to tokenize the prompt in advance to see if the prompt will be too big and warn the user. That or enforcing a max number of exposed devices that the model can support.

wont-stream commented 7 months ago

Any way to get models that support greater amounts of exposed devices? This would be useful for whole home setups -- In my case its just my bedroom and it struggles with that

acon96 commented 7 months ago

Any way to get models that support greater amounts of exposed devices? This would be useful for whole home setups -- In my case its just my bedroom and it struggles with that

You can try to use the Self-Extend feature in llama.cpp. That should theoretically let you scale the model up to 8k tokens. That being said it will consume a TON of RAM (or VRAM) to calculate the larger context sizes because of the whole O(n^2) memory requirement for attention. My napkin math says it would take around 55GB of memory to do the 8k context size for the 3B model. I haven't messed with it since the higher VRAM consumption is way to high for my GPU.

acon96 commented 5 months ago

I'm going to go ahead and close this now that v0.2.12 has support for warning the user if they are exceeding the context length set by the user.

acon96 / home-llm

Models repeat instructions #55