Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with full RAG and AI Agent capabilities.
https://useanything.com
MIT License
16.81k stars 1.8k forks source link

[BUG]: Confluence connector only save partial document body in JSON #1501

Closed jazelly closed 1 day ago

jazelly commented 1 month ago

How are you running AnythingLLM?

Local development

What happened?

After confluence connector scraping confluence documents, the document bodies are not fully saved in JSON under storage.

After embedding them, it will not provide useful info as expected. For example, we have a confluence doc containing some code snippets and would like to ask questions to retrieve that. The code snippets is lost after scraping, however, which caused the LLM to response basic info.

I am not sure if this is a limitation of Atlassian API but surely users would expect more than just some basic info of the confluence documents.

Are there known steps to reproduce?

No response

timothycarambat commented 1 month ago

@jazelly the pageContent of the associated docment is empty?

jazelly commented 1 month ago

@timothycarambat the pageContent is not empty. It has content, but just not include script content, e.g.

VIEW ALL\nsql\nASSIGN TO AN ACCOUNT\nThe account must already exist.\nsql\n

Notice the sql in the pageContent, which is supposed to be a SQL command. LLM makes up the answers when we ask a question related to that, since the prompt contains no reference to the real command

jainpradeep commented 1 month ago

Issue faced with local deployment as well. LLM responses are poor.

timothycarambat commented 1 month ago

Issue faced with local deployment as well. LLM responses are poor.

Has nothing to do with the deployment method or RAG structure, the RAG results are bad because the scraper is returning poor information from the documents. As @jazelly mentions, it seems like some non-text blocks are not returned or parsed using the Langchain parser - which is where this lies

jazelly commented 1 month ago

This might be an issue better for LangChain community.

To us, the current solution is nothing more than writing our own scraper to download these documents, and upload them to anything-llm via APIs