Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, and more.
https://anythingllm.com
MIT License
27.16k stars 2.72k forks source link

[BUG]: Error: 1 documents failed to add. #1291

Closed 3SMMZRjWgS closed 6 months ago

3SMMZRjWgS commented 6 months ago

How are you running AnythingLLM?

AnythingLLM desktop app

What happened?

When I try to embed the attached PDF (it's a public document): ConEd - Climate Change Vulnerability Study - 20230901.pdf, I receive the following error:

Error: 1 documents failed to add.
Invalid argument error: Values length 26624 is less than the length (1024) multiplied by the value size (1024) for FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 1024)

Similar errors show up on other large (> 150 pages) PDFs I've tested. One apparent solution is to manually segment a PDF to smaller (~50 pages) PDFs. Is this error caused by my embedding model settings in the Anything LLM app or some other reason(s)?

Below is a screenshot of the embedding model settings in AnythingLLM:

Screenshot 2024-05-05 213607

I'm using AnythingLLM Windows Desktop App 1.5.3 installed on Windows 11 Pro. I'm using Ollama 0.1.33 as the backend for LLMs and embedding models.

Are there known steps to reproduce?

Simply download the attached PDF, load it into AnythingLLM, and try to embed it using mxbai-embed-large provided by Ollama.

RahSwe commented 6 months ago

Does the server have enough RAM?

help4bis commented 6 months ago

Tested upload on my server, it works fine. Try to increase your token context window.

Chat Model INstalled gfg/solar-10b-instruct-v1.0 Token Context Window 256000

Server has no GPU has 49Gb of memory, so not something really big.

3SMMZRjWgS commented 6 months ago

Tested upload on my server, it works fine. Try to increase your token context window.

Chat Model INstalled gfg/solar-10b-instruct-v1.0 Token Context Window 256000

Server has no GPU has 49Gb of memory, so not something really big.

Thank you for the tip on the context window enlargement. I can confirm a larger context window appears to have solved the error. However, I'm not understanding this: if mxbai-embed-large has a recommended context window of 512, can we experience loss in information during embedding by forcing a larger context window to 256k?

timothycarambat commented 6 months ago

Thank you for the tip on the context window enlargement. I can confirm a larger context window appears to have solved the error. However, I'm not understanding this: if mxbai-embed-large has a recommended context window of 512, can we experience loss in information during embedding by forcing a larger context window to 256k?

Yes, you should not try to embed chunks longer than the recommended token context window. Some models will allow it but the vector search after the fact will not perform as you would expect and you will get worse results or information will be lost. We cannot get the context length from the model or the provider as none show it - so that field is important to be correct, but we cant fill it in for you