[BUG]: Huggingface response truncated

gcleaves commented 3 weeks ago

How are you running AnythingLLM?

Docker (remote machine)

What happened?

I've embedded the chat widget in a web page. When asking a question via the chat embed, the response is truncated. When asking the same question within the AnywhereLLM workspace, the response is not truncated.

I got 374 characters in the embed widget.

Are there known steps to reproduce?

No response

timothycarambat commented 3 weeks ago

We will need more information on this.

Do you see any client disconnections from the chat embed widget from the network inspector
Does the message saved in the chat embed history show the truncated amount as well - or is it fully formed?
What provider and model are you using?
Do you have associated logs from the Docker container as well (if errors persist)

gcleaves commented 3 weeks ago

My reference to the embedded chat widget was a red herring. (I've change the title of the issue.) Even when I use the standard interface to dockerized AnythingLLM the response is truncated. My desktop install running on my Mac M1, on the other hand, does NOT truncate the data.

Docker instance
- LLM provider is Hugginface with dedicated inference endpoint running mistral-7b-instruct-v0-3-byo.
  - GPU · Nvidia A10G · 1x GPU · 24 GB
  - Quantization: none
  - Max Input Length (per Query): 2048
  - Max Number of Tokens (per Query): 4096 | Doubled to 8192 but no change in truncation.
  - Max Batch Prefill Tokens: 8192
- Default vector DB
  - Max Context Snippets: 4
- Default embedder
Local instance
- LLM provider is the included AnythingLLM provider running Mistral7B.
- Default vector DB
  - Max Context Snippets: 4
- Default embedder

In both instances I have uploaded 8 word documents which contain the answer to my truncated response.

I don't understand much about LLMs (shocker). How do the document embeddings created by AnythingLLM get to the model? Are they included somehow in the prompt? Might my document embeddings not be reaching the Huggingface model in full? Although it seems more like a token response limit that is hampering my Docker instance.

gcleaves commented 3 weeks ago

This looks to be a Huggingface problem:

https://discuss.huggingface.co/t/text-generation-response-truncation/53155 https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3/discussions/23

There's reference to a max_new_tokens parameter which may or may not solve the problem.

timothycarambat commented 3 weeks ago

Ah, that makes more sense. max_tokens or max_new_tokens should resolve the truncation, there is little or no documentation I can find currently that lets us pass like -1 to that property to allow as many tokens out as needed.

We may have to make an input for this so that we can easily have this property controlled. That being said it is very weird the desktop does not truncate.

Does truncation happen on an empty workspace on the first message? There is usually a max_tokens parameter that is the maximum of input+output tokens and that could cause truncation at random points in a conversation since we inject system prompt + context + history - so a seemingly small query can be many more tokens on the backend - thus resulting in truncation.

gcleaves commented 3 weeks ago

Could the reason truncation doesn't occur on desktop be because the desktop uses AnythingLLM/Ollama as the LLM provider, while Docker relies on HF?

timothycarambat commented 3 weeks ago

Could the reason truncation doesn't occur on desktop be because the desktop uses AnythingLLM/Ollama as the LLM provider, while Docker relies on HF?

The desktop simply uses Ollama as the default. But you can use whatever you want as usual. The HF providers are the same on desktop/docker so if you used the same HF credentials on Desktop you should still get the same behavior. Its the same code, but its certainly specific to the HF connector for the LLM and the issue should be present in both Desktop and Docker

Mintplex-Labs / anything-llm