Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, and more.
https://anythingllm.com
MIT License
27.44k stars 2.76k forks source link

Unable to upload PDF files: Table was not found #2521

Open Timi7007 opened 1 month ago

Timi7007 commented 1 month ago

How are you running AnythingLLM?

Docker (local)

What happened?

I'm afraid I'm doing something wrong, as I can't get documents added using the "Save and embed" dialog. Logs show the following:

[backend] info: Adding new vectorized document into namespace buffalo-bills
[backend] info: [RecursiveSplitter] Will split with {"chunkSize":8192,"chunkOverlap":20}
[backend] info: Chunks created from document: 15
[backend] info: [OllamaEmbedder] Embedding 15 chunks of text with nomic-embed-text:latest.
[backend] info: Inserting vectorized chunks into LanceDB collection.
[backend] error: addDocumentToNamespace Table 'buffalo-bills' was not found
[backend] error: Failed to vectorize Buffalo Bills - Wikipedia.pdf

The "Table 'buffalo-bills' was not found" error gets forwarded to the frontend.

Are there known steps to reproduce?

Vector-DB is set to LanceDB as per default, embedding provider is Ollama, I've tried different embedding models with the same result.

Timi7007 commented 1 month ago

I've just tried again using the native "AnythingLLM Embedder" with the following non-functional result:

[backend] info: Adding new vectorized document into namespace buffalo-bills
[backend] info: [NativeEmbedder] Initialized
[backend] info: [RecursiveSplitter] Will split with {"chunkSize":1000,"chunkOverlap":20}
[backend] info: Chunks created from document: 76
[backend] info: [NativeEmbedder] Embedded Chunk 1 of 4
[backend] info: [NativeEmbedder] Embedded Chunk 2 of 4
[backend] info: [NativeEmbedder] Embedded Chunk 3 of 4
[backend] info: [NativeEmbedder] Embedded Chunk 4 of 4
[backend] info: Inserting vectorized chunks into LanceDB collection.
[backend] error: addDocumentToNamespace lance error: LanceError(IO): Generic LocalFileSystem error: Unable to copy file from /app/server/storage/lancedb/buffalo-bills.lance/_versions/.tmp_1.manifest_4be321d5-b2ec-4add-9c83-2c258c1669b6 to /app/server/storage/lancedb/buffalo-bills.lance/_versions/1.manifest: Function not implemented (os error 38), /home/build_user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lance-table-0.12.1/src/io/commit.rs:692:54
[backend] error: Failed to vectorize 2021 Buffalo Bills season - Wikipedia.pdf
[backend] info: Adding new vectorized document into namespace buffalo-bills
[backend] info: [NativeEmbedder] Initialized
[backend] info: [RecursiveSplitter] Will split with {"chunkSize":1000,"chunkOverlap":20}
[backend] info: Chunks created from document: 95
[backend] info: [NativeEmbedder] Embedded Chunk 1 of 4
[backend] info: [NativeEmbedder] Embedded Chunk 2 of 4
[backend] info: [NativeEmbedder] Embedded Chunk 3 of 4
[backend] info: [NativeEmbedder] Embedded Chunk 4 of 4
[backend] info: Inserting vectorized chunks into LanceDB collection.
[backend] error: addDocumentToNamespace Table 'buffalo-bills' was not found
[backend] error: Failed to vectorize 2022 Buffalo Bills season - Wikipedia.pdf

This even seems like separate errors. Please advise.

timothycarambat commented 1 month ago

What does your PDF look like - clearly there is some external or embedded reference to a table that cannot be parsed out of the document

Timi7007 commented 3 weeks ago

Tried again with a plain .txt, single line, one sentence, no special characters. Still the same issue:

[collector] info: -- Working test.txt --
[collector] info: [SUCCESS]: test.txt converted & ready for embedding.
[backend] info: [CollectorApi] Document test.txt uploaded processed and successfully. It is now available in documents.
[backend] info: [Event Logged] - document_uploaded
[backend] info: Adding new vectorized document into namespace testworkspace
[backend] info: [NativeEmbedder] Initialized
[backend] info: [RecursiveSplitter] Will split with {"chunkSize":1000,"chunkOverlap":20}
[backend] info: Chunks created from document: 1
[backend] info: [NativeEmbedder] Embedded Chunk 1 of 1
[backend] info: Inserting vectorized chunks into LanceDB collection.
[backend] error: addDocumentToNamespace lance error: LanceError(IO): Generic LocalFileSystem error: Unable to copy file from /app/server/storage/lancedb/testworkspace.lance/_versions/.tmp_1.manifest_404b8afe-8daf-4083-9a62-785ca4d619a9 to /app/server/storage/lancedb/testworkspace.lance/_versions/1.manifest: Function not implemented (os error 38), /home/build_user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lance-table-0.12.1/src/io/commit.rs:692:54
[backend] error: Failed to vectorize test.txt
[backend] info: [Event Logged] - workspace_documents_added
[backend] info: Adding new vectorized document into namespace testworkspace
[backend] info: [NativeEmbedder] Initialized
[backend] info: [RecursiveSplitter] Will split with {"chunkSize":1000,"chunkOverlap":20}
[backend] info: Chunks created from document: 1
[backend] info: [NativeEmbedder] Embedded Chunk 1 of 1
[backend] info: Inserting vectorized chunks into LanceDB collection.
[backend] error: addDocumentToNamespace Table 'testworkspace' was not found
[backend] error: Failed to vectorize test.txt
[backend] info: [TELEMETRY SENT] {"event":"documents_embedded_in_workspace","properties":{"LLMSelection":"ollama","Embedder":"native","VectorDbSelection":"lancedb","TTSSelection":"native","runtime":"docker"}}
[backend] info: [Event Logged] - workspace_documents_added
timothycarambat commented 3 weeks ago

This is your issue, its from the lanceDB integration for storing the vectors [backend] error: addDocumentToNamespace lance error: LanceError(IO): Generic LocalFileSystem error: Unable to copy file from /app/server/storage/lancedb/testworkspace.lance/_versions/.tmp_1.manifest_404b8afe-8daf-4083-9a62-785ca4d619a9 to /app/server/storage/lancedb/testworkspace.lance/_versions/1.manifest: Function not implemented (os error 38), /home/build_user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lance-table-0.12.1/src/io/commit.rs:692:54

The issue says you are running in Docker, what does the OS you are running on look like and is this using the official image or a custom build?

So the root cause is that causing upserts to fail because tables cannot be written to lance files.

timothycarambat commented 3 weeks ago

We have both an x86 and arm image available. Typically trying to run an incompatible arch on the host via docker causes issues like this. Also when the docker storage is mounted to a network drive this can cause IO operation failures