QuivrHQ / quivr

Open-source RAG Framework for building GenAI Second Brains 🧠 Build productivity assistant (RAG) ⚡️🤖 Chat with your docs (PDF, CSV, ...) & apps using Langchain, GPT 3.5 / 4 turbo, Private, Anthropic, VertexAI, Ollama, LLMs, Groq that you can share with users ! Efficient retrieval augmented generation framework
https://quivr.com
Other
36.06k stars 3.5k forks source link

[Bug]: Knowledge base addition does not work. #2411

Closed aomeng1219 closed 2 months ago

aomeng1219 commented 5 months ago

What happened?

Running process, and uploading PDF files are errors, but when looking for the contents of these files, the reply is always unknown. It seems that nothing has been added to the knowledge base.

Relevant log output

No response

Twitter / LinkedIn details

No response

dosubot[bot] commented 5 months ago

Hey there @aomeng1219! 👋 I'm Dosu, your friendly neighborhood bot here to lend a hand with bugs, answer your queries, and help you dive into contributing while we wait for a human maintainer to swing by. I'll be taking a look at your issue about PDF files not adding to the knowledge base in quivr and get back to you with a full answer shortly. Hang tight! 🚀

aomeng1219 commented 5 months ago

Did PDF fail to produce text vector during import?

dosubot[bot] commented 5 months ago

Based on the information you've provided and the similar issues I found, it seems like the problem might be related to the state of the DocumentSerializable object when the first question is asked. If the document content is being read normally as per the system log, then the issue might be with how the content is being stored or retrieved.

One possible cause could be a race condition where the question is asked before the document content is fully processed and stored. This could explain why the second question can get a normal response, as by that time the processing might have completed.

To debug this, you could add some logging in the upload_file_storage function and the from_json method of the DocumentSerializable class to check the state of the document content at different stages. Here's an example of how you could do this:

def upload_file_storage(file, file_identifier: str):
    supabase_client: Client = get_supabase_client()
    response = supabase_client.storage.from_("quivr").upload(file_identifier, file)
    print(f"Upload response: {response}")  # Add logging
    return response

class DocumentSerializable(Document):
    page_content: str
    metadata: dict = Field(default_factory=dict)

    def to_json(self) -> str:
        return json.dumps({"page_content": self.page_content, "metadata": self.metadata})

    @classmethod
    def from_json(cls, json_str: str):
        data = json.loads(json_str)
        print(f"Data from JSON: {data}")  # Add logging
        return cls(page_content=data["page_content"], metadata=data["metadata"])

This should give you more information about what's happening when the document content is being processed.

If the issue persists, it would be helpful to know the exact version of Quivr you're using, as well as any error messages you're seeing in the console or logs. This will help narrow down the possible causes of the issue.

Regarding your question about the PDF files failing to produce a text vector during the import process, yes, this could indeed be the cause. The UnstructuredPDFLoader class from langchain_community.document_loaders is used to load and possibly parse the PDF files. If this loader fails to extract text from the PDFs correctly, it could prevent the generation of a text vector, which is essential for adding the content to the knowledge base.

To resolve this issue, you could:

  1. Verify the PDF Content: Ensure that the PDFs contain text that can be extracted. Some PDFs might have their content in images or in a format that the UnstructuredPDFLoader cannot parse.

  2. Update Dependencies: Make sure that all dependencies, especially those related to langchain_community.document_loaders, are up to date. An outdated dependency might lack the necessary functionality or fixes to parse certain PDFs.

  3. Debug the Loader: Add logging or debugging statements within the UnstructuredPDFLoader or the process_pdf function to check if the PDFs are being loaded correctly and if the text extraction process is completing successfully.

  4. Customize or Extend the Loader: If the UnstructuredPDFLoader is not suitable for your PDFs, consider customizing this loader or creating a new loader class that can handle your specific PDF format more effectively.

  5. Error Handling: Implement error handling in the process_pdf function to catch and log any exceptions during the PDF processing. This can help identify if the failure is due to an error in the text extraction process.

I hope this helps! If you have any other questions or need further clarification, please let me know.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
aomeng1219 commented 5 months ago

"Error loading punkt: <urlopen error [Errno 99] Cannot",I found the above error

github-actions[bot] commented 2 months ago

Thanks for your contributions, we'll be closing this issue as it has gone stale. Feel free to reopen if you'd like to continue the discussion.