Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, and more.
https://anythingllm.com
MIT License
25.46k stars 2.58k forks source link

[BUG]: Huggingface response truncated #2452

Open gcleaves opened 3 weeks ago

gcleaves commented 3 weeks ago

How are you running AnythingLLM?

Docker (remote machine)

What happened?

I've embedded the chat widget in a web page. When asking a question via the chat embed, the response is truncated. When asking the same question within the AnywhereLLM workspace, the response is not truncated.

I got 374 characters in the embed widget.

Are there known steps to reproduce?

No response

timothycarambat commented 3 weeks ago

We will need more information on this.

gcleaves commented 3 weeks ago

My reference to the embedded chat widget was a red herring. (I've change the title of the issue.) Even when I use the standard interface to dockerized AnythingLLM the response is truncated. My desktop install running on my Mac M1, on the other hand, does NOT truncate the data.

In both instances I have uploaded 8 word documents which contain the answer to my truncated response.

I don't understand much about LLMs (shocker). How do the document embeddings created by AnythingLLM get to the model? Are they included somehow in the prompt? Might my document embeddings not be reaching the Huggingface model in full? Although it seems more like a token response limit that is hampering my Docker instance.

gcleaves commented 3 weeks ago

This looks to be a Huggingface problem:

https://discuss.huggingface.co/t/text-generation-response-truncation/53155 https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3/discussions/23

There's reference to a max_new_tokens parameter which may or may not solve the problem.

timothycarambat commented 3 weeks ago

Ah, that makes more sense. max_tokens or max_new_tokens should resolve the truncation, there is little or no documentation I can find currently that lets us pass like -1 to that property to allow as many tokens out as needed.

We may have to make an input for this so that we can easily have this property controlled. That being said it is very weird the desktop does not truncate.

Does truncation happen on an empty workspace on the first message? There is usually a max_tokens parameter that is the maximum of input+output tokens and that could cause truncation at random points in a conversation since we inject system prompt + context + history - so a seemingly small query can be many more tokens on the backend - thus resulting in truncation.

gcleaves commented 3 weeks ago

Could the reason truncation doesn't occur on desktop be because the desktop uses AnythingLLM/Ollama as the LLM provider, while Docker relies on HF?

timothycarambat commented 3 weeks ago

Could the reason truncation doesn't occur on desktop be because the desktop uses AnythingLLM/Ollama as the LLM provider, while Docker relies on HF?

The desktop simply uses Ollama as the default. But you can use whatever you want as usual. The HF providers are the same on desktop/docker so if you used the same HF credentials on Desktop you should still get the same behavior. Its the same code, but its certainly specific to the HF connector for the LLM and the issue should be present in both Desktop and Docker