danswer-ai / danswer

Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.
https://docs.danswer.dev/
Other
9.74k stars 1.08k forks source link

PDF file upload not correctly parsed #1624

Open plopezamaya opened 2 weeks ago

plopezamaya commented 2 weeks ago

While using the v3.0.79 It seems that some pdfs are not currently parsed well when using from pypdf import PdfReader in backend/danswer/file_processing/extract_file_text.py.

The result is that the llm answers that it cannot retrieve any information from the given document. Should a OCR reader or other framework be used for this ?

arrfandannge commented 2 weeks ago

You might need to use OCR to extract text, If the PDFs contain scanned images or are image-based. pytesseract along-with an image processing library like Pillow should work to extract text from images within PDFs.

plopezamaya commented 2 weeks ago

@arrfandannge Meaning that what is implemented today in danswer should be replaced by an OCR or Pillow ?

arrfandannge commented 2 weeks ago

@arrfandannge Meaning that what is implemented today in danswer should be replaced by an OCR or Pillow ?

I suggest keeping the current implementation intact but adding code that uses OCR as a fallback method when pypdf doesn't help. This will keep the solution robust and ensure efficiency because extraction with pypdf is generally faster than OCR.

I can try implementing this solution in a separate branch and see how it works.