Yuras / pdf-toolbox

A collection of tools for processing PDF files in Haskell
181 stars 25 forks source link

Process (by ignoring) inline images #82

Open alanz opened 1 year ago

alanz commented 1 year ago

I am interested in extracting text from bank statement PDFs. Some of these have inline images, which the toolkit does not currently support. Process them, by simply ignoring in the stream, but letting the parse continue. Also, allow exporting the raw operators, not just page glyphs.

Yuras commented 1 year ago

Hi, thank you for the PR!

There seems to be two independent features:

Yuras commented 1 year ago

@alanz I tried to make a test case to reproduce the issue with text extraction in presence of inline images. I.e. I created a PDF file with inline image and tried to extract text. Everything works well so far, see https://github.com/Yuras/pdf-toolbox/pull/83. It probably works purely by accident (i.e. we treat any unknown thing as an operator), but it does work. I assume you are running into some kind of a corner case. Could you please help me identify the underlying issue. E.g. share the PDF file or, if not possible, the problematic part of the content stream.