Closed aomeng1219 closed 2 months ago
Hey there @aomeng1219! 👋 I'm Dosu, your friendly neighborhood bot here to lend a hand with bugs, answer your queries, and help you dive into contributing while we wait for a human maintainer to swing by. I'll be taking a look at your issue about PDF files not adding to the knowledge base in quivr and get back to you with a full answer shortly. Hang tight! 🚀
Did PDF fail to produce text vector during import?
Based on the information you've provided and the similar issues I found, it seems like the problem might be related to the state of the DocumentSerializable
object when the first question is asked. If the document content is being read normally as per the system log, then the issue might be with how the content is being stored or retrieved.
One possible cause could be a race condition where the question is asked before the document content is fully processed and stored. This could explain why the second question can get a normal response, as by that time the processing might have completed.
To debug this, you could add some logging in the upload_file_storage
function and the from_json
method of the DocumentSerializable
class to check the state of the document content at different stages. Here's an example of how you could do this:
def upload_file_storage(file, file_identifier: str):
supabase_client: Client = get_supabase_client()
response = supabase_client.storage.from_("quivr").upload(file_identifier, file)
print(f"Upload response: {response}") # Add logging
return response
class DocumentSerializable(Document):
page_content: str
metadata: dict = Field(default_factory=dict)
def to_json(self) -> str:
return json.dumps({"page_content": self.page_content, "metadata": self.metadata})
@classmethod
def from_json(cls, json_str: str):
data = json.loads(json_str)
print(f"Data from JSON: {data}") # Add logging
return cls(page_content=data["page_content"], metadata=data["metadata"])
This should give you more information about what's happening when the document content is being processed.
If the issue persists, it would be helpful to know the exact version of Quivr you're using, as well as any error messages you're seeing in the console or logs. This will help narrow down the possible causes of the issue.
Regarding your question about the PDF files failing to produce a text vector during the import process, yes, this could indeed be the cause. The UnstructuredPDFLoader
class from langchain_community.document_loaders
is used to load and possibly parse the PDF files. If this loader fails to extract text from the PDFs correctly, it could prevent the generation of a text vector, which is essential for adding the content to the knowledge base.
To resolve this issue, you could:
Verify the PDF Content: Ensure that the PDFs contain text that can be extracted. Some PDFs might have their content in images or in a format that the UnstructuredPDFLoader
cannot parse.
Update Dependencies: Make sure that all dependencies, especially those related to langchain_community.document_loaders
, are up to date. An outdated dependency might lack the necessary functionality or fixes to parse certain PDFs.
Debug the Loader: Add logging or debugging statements within the UnstructuredPDFLoader
or the process_pdf
function to check if the PDFs are being loaded correctly and if the text extraction process is completing successfully.
Customize or Extend the Loader: If the UnstructuredPDFLoader
is not suitable for your PDFs, consider customizing this loader or creating a new loader class that can handle your specific PDF format more effectively.
Error Handling: Implement error handling in the process_pdf
function to catch and log any exceptions during the PDF processing. This can help identify if the failure is due to an error in the text extraction process.
I hope this helps! If you have any other questions or need further clarification, please let me know.
"Error loading punkt: <urlopen error [Errno 99] Cannot",I found the above error
Thanks for your contributions, we'll be closing this issue as it has gone stale. Feel free to reopen if you'd like to continue the discussion.
What happened?
Running process, and uploading PDF files are errors, but when looking for the contents of these files, the reply is always unknown. It seems that nothing has been added to the knowledge base.
Relevant log output
No response
Twitter / LinkedIn details
No response