Process (by ignoring) inline images

Yuras / pdf-toolbox

A collection of tools for processing PDF files in Haskell

181 stars 25 forks source link

Process (by ignoring) inline images #82

Open alanz opened 1 year ago

alanz commented 1 year ago

I am interested in extracting text from bank statement PDFs. Some of these have inline images, which the toolkit does not currently support. Process them, by simply ignoring in the stream, but letting the parse continue. Also, allow exporting the raw operators, not just page glyphs.

Yuras commented 1 year ago

Hi, thank you for the PR!

There seems to be two independent features:

skip inline images It's pretty clear what this is about, though I'll need some time to dig into the spec to figure out what exactly is going on here.
collect all operators. Why do you need them? Can pageExtractOperators be implemented separately outside of the library? Is it general enough to be useful for other people? It would help if you describe your use case.

Yuras commented 1 year ago

@alanz I tried to make a test case to reproduce the issue with text extraction in presence of inline images. I.e. I created a PDF file with inline image and tried to extract text. Everything works well so far, see https://github.com/Yuras/pdf-toolbox/pull/83. It probably works purely by accident (i.e. we treat any unknown thing as an operator), but it does work. I assume you are running into some kind of a corner case. Could you please help me identify the underlying issue. E.g. share the PDF file or, if not possible, the problematic part of the content stream.