Open alanz opened 1 year ago
Hi, thank you for the PR!
There seems to be two independent features:
pageExtractOperators
be implemented separately outside of the library? Is it general enough to be useful for other people? It would help if you describe your use case.@alanz I tried to make a test case to reproduce the issue with text extraction in presence of inline images. I.e. I created a PDF file with inline image and tried to extract text. Everything works well so far, see https://github.com/Yuras/pdf-toolbox/pull/83. It probably works purely by accident (i.e. we treat any unknown thing as an operator), but it does work. I assume you are running into some kind of a corner case. Could you please help me identify the underlying issue. E.g. share the PDF file or, if not possible, the problematic part of the content stream.
I am interested in extracting text from bank statement PDFs. Some of these have inline images, which the toolkit does not currently support. Process them, by simply ignoring in the stream, but letting the parse continue. Also, allow exporting the raw operators, not just page glyphs.