[BUG]: Saving and embedding document takes a while and then fails when using the default anythingllm embedder

Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, and more.

https://anythingllm.com

MIT License

26.54k stars 2.65k forks source link

[BUG]: Saving and embedding document takes a while and then fails when using the default anythingllm embedder #838

Closed timwillhack closed 6 months ago

timwillhack commented 8 months ago

How are you running AnythingLLM?

AnythingLLM desktop app

What happened?

New user of anything llm. was able to embed so far with open ai embedder but wanted to try the default model instead.

Document is a simple txt document of about 700kb.

Earlier today it didn't work because huggingface was down for maintenance. After it came back I tried it again.

After clicking the button to save and embed, it spins for about a minute says updating workspace, and then shows this error: Error: 1 documents failed to add.

Invalid argument error: Values length 1278336 is less than the length (1536) multiplied by the value size (1536) for FixedSizeList(Field { name: "item", data_type: Float32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 1536)

Are there known steps to reproduce?

No response

timothycarambat commented 8 months ago

What vector database are you connected to? This doesn't look like a LanceDB error at first glance

timwillhack commented 8 months ago

LanceDB vector db. Tried again today and I'm still getting the same error.

timothycarambat commented 8 months ago

If you delete the workspace and try to re-embed what do you get? The 1536 dimension expectation is mostly odd because that is the text-embedding-002 dimension. The built-in embedder is 384. No real indication where 1278336 is coming from though

mark3apps commented 8 months ago

Hey, I noticed a similar issue on a 4,000 page PDF I uploaded. The error seems to be just a UI timeout on my end though. I was watching the CPU and it kept processing after receiving the error in the Web UI and after a while, it finished and the document showed up in the available documents to provide into context. So this seems to me as more of a timeout set too low issue. (If your issue is the same one that I was experiencing.

yeasy commented 7 months ago

It is very slow to save and embed a doc on macos with default setup.

I noticed that the anything-llm container's cpu is 100%, and the container cannot use more than 1 cpu.

timothycarambat commented 7 months ago

I mean in general, if you are hoping to be able to embed 20K text chunks using CPU that is pretty out of scope for the default embedder and you should then migrate to a dedicated model runner and offload to GPU (ollama, localAI, etc) or leverage cloud hosted models (OpenAi).

The native embedder is the default because it is zero set up, it is not the end-all-be-all and should not be expected to parallel process hundreds of thousands of text chunks at a time. That is why cloud embedding services exist, at that scale and volume, running embedding is non-trivial.

For most people, most document sets, and most use-cases the native model works fine without a hiccup. If you know your pipeline will be more than a hundred unique docs then AnythingLLM is prepared to integrate with those providers so it can still work for you.

As a side note, Docker does support multi-cpu. Is the container only using one CPU even with --cpus xx as an arg? I think by default it is 1

timwillhack commented 7 months ago

I'm still unable to embed any documents using the Windows app using the default embedder. Was hoping this would clear up enough to at least see how well this works for a couple smallish documents.

timothycarambat commented 6 months ago

closing as stale

gaungxl commented 6 months ago

I'm still unable to embed any documents using the Windows app using the default embedder. Was hoping this would clear up enough to at least see how well this works for a couple smallish documents.

Hi. I have encountered the same problem as you, and I accidentally found a solution for the Anything LLM Windows app in Windows server 2022, as follows: 1, close AnythingLLM; 2, delete the ".env" file in C:\Users\Administrator\AppData\Roaming\anythingllm-desktop\storage; 3, re-open and then set LLM to llama3; Embedding to AnythingLLM Embeder; Set Vector to LanceDB. OK, it is working properly. But if Embedding is set to nomic-embed-text in Ollama, it won't work. I can't explain the reason. As a amateur, I am glad to answer questions in GitHub for the first time, instead of copying other people's solutions all the time.

timothycarambat commented 6 months ago

All you did in the solution is update the Embedder provider back to the native provider. If you get this with ollama your connection information is either incorrect or you set an invalid context length