[BUG]: Can not save & embed files in workspace with name containing "+"

paul-sonnenschein commented 1 month ago

How are you running AnythingLLM?

Docker (remote machine)

What happened?

AnythingLLM is setup using Docker compose, using a LocalAI LLM backend, LanceDB vector database and the built-in embedder.

Creating a new workspace containing the letter "+" works without any error. However attempting to "Save & Embed" an uploaded file results in an error message instead of a successful embedding. Attempting to embed in a workspace not containing the letter "+" reports success.

Expected behavior:

Either the workspace creating fails with a suitable error message indicating the unsupported character or a supported table name is selected automatically.

Error message + Log output:

Error message:

Invalid table name ("test+test"): Table names can only contain alphanumeric characters, underscores, hyphens, and periods

Log excerpt:

[backend] info: Adding new vectorized document into namespace
[backend] info: Cached vectorized results of custom-documents/Not-emtpy.txt-f6ee5465-6e12-44b3-9309-466ab7b39d5e.json found! Using cached data to save on embed costs.
[backend] error: addDocumentToNamespace
[backend] error: Failed to vectorize
[backend] info: [Event Logged] - workspace_documents_added

Are there known steps to reproduce?

Setup AnythingLLM
Create a workspace with an appropiate name, e.g. "Test+Test".
Upload a non-empty file.
Select "Move to workspace"
Press Save and Embed

CLBarajas commented 1 month ago

I think this can also affect transcripts pulled from the YouTube data connector, if the video title includes a "+" character.

vincenthaney commented 1 month ago

Hey! I am still experiencing this issue. I am executing exactly the same steps above.

I tried with .TXT and .PDF files.

I get a warning from LanceDB: invalid ENV settings.

Here is the env file: `# Auto-dump ENV from system call on 17:27:25 GMT+0000 (Coordinated Universal Time) LLM_PROVIDER='ollama' EMBEDDING_MODEL_PREF='nomic-embed-text:latest' OLLAMA_BASE_PATH='http://host.docker.internal:11434' OLLAMA_MODEL_PREF='llama3' OLLAMA_MODEL_TOKEN_LIMIT='4096' EMBEDDING_BASE_PATH='http://host.docker.internal:11434' EMBEDDING_MODEL_MAX_CHUNK_LENGTH='8192' STORAGE_DIR='/app/server/storage' SERVER_PORT='3001' SIG_KEY= SIG_SALT=

Please note that Ollama is working properly.

Here's the log from the server: `[backend] info: [EncryptionManager] Loaded existing key & salt for encrypting arbitrary data. [collector] info: -- Working New Text Document (2).txt -- [collector] info: [SUCCESS]: New Text Document (2).txt converted & ready for embedding.

[backend] info: [CollectorApi] Document New Text Document (2).txt uploaded processed and successfully. It is now available in documents. [backend] info: [TELEMETRY SENT] [backend] info: [Event Logged] - document_uploaded [backend] info: Adding new vectorized document into namespace [backend] info: [NativeEmbedder] Initialized [backend] info: [RecursiveSplitter] Will split with [backend] info: Chunks created from document: [backend] info: [NativeEmbedder] Embedded Chunk 1 of 4 [backend] info: [NativeEmbedder] Embedded Chunk 2 of 4 [backend] info: [NativeEmbedder] Embedded Chunk 3 of 4 [backend] info: [NativeEmbedder] Embedded Chunk 4 of 4 [backend] info: Inserting vectorized chunks into LanceDB collection. [backend] error: addDocumentToNamespace [backend] error: Failed to vectorize [backend] info: [TELEMETRY SENT] [backend] info: [Event Logged] - workspace_documents_added`

Please let me know if you find a fix for this! Thanks

Mintplex-Labs / anything-llm