Closed satyaloka93 closed 9 months ago
Hi @satyaloka93 and thanks for the report.
As much as we know it works very well, we decided not to include PyMuPDF in the Haystack core because of its licensing. But it can definitely be implemented as an external integration if anybody wants to give it a shot.
Closing as won't fix in the context of this repo.
Is your feature request related to a problem? Please describe. This is to rectify issues I am noticing using pypdf to convert pdf documents. Pypdf is producing junky text for technical programming documents that I've encountered several times so far. One example is from Fundamentals of Python Programming (https://folk.ntnu.no/sverrsti/INGG1001-H2019/pythonbook_20191015.pdf), where code characters are represented as follows:
A more capable convertor that I've tested produces the following for this section:
I noticed this behavior propagating to responses during RAG where I scratched my head where it came up with this (using RAG to inform code generation), when those characters weren't in the source material, but instead generated by pypdf.
Describe the solution you'd like
Implement PyMuPDF as an alternative for a pdf convertor.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context A quick implementation of PyMuPDF in the pypdf.py file under components/converters. It's replacing pypdf, but you could probably just have both classes together.
pypdf_using_PyMuPDF.txt