alejandro-ao / ask-multiple-pdfs

A Langchain app that allows you to chat with multiple PDFs
1.6k stars 913 forks source link

Replace PyPDF2 with pypdfium2 #38

Open yiwei-ang opened 11 months ago

yiwei-ang commented 11 months ago

I really appreciate @alejandro-ao for creating good video demonstrating the perfect blend of openai, PDF readers and streamlit!

I've tried to use the tool for several PDFs, I found that there's an issue of text extraction quality using PyPDF2, that contexts of a PDF are not extracted fully and completely.

After looking into https://github.com/py-pdf/benchmarks, it seems we can go with pypdfium2 that serves similar functionality, while providing better text extraction quality and faster computational time (Verified from my end!)

IlianP commented 10 months ago

As a side note, LangChain also supports pypdfium2 as a document loader: https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf#using-pypdfium2

costabm commented 8 months ago

I have added this important feature to my larger pull request (my first one ever). I gave you credit there, but no sure this is the right way to do it.