LibrePDF / OpenPDF

OpenPDF is a free Java library for creating and editing PDF files, with a LGPL and MPL open source license. OpenPDF is based on a fork of iText. We welcome contributions from other developers. Please feel free to submit pull-requests and bugreports to this GitHub repository.
Other
3.54k stars 582 forks source link

Text extractor NullPointerException #650

Open seanleblancicdtech opened 2 years ago

seanleblancicdtech commented 2 years ago

I get a NPE when trying to extract the page of a file. Unfortunately, this might be hard to troubleshoot, since I will be unable to provide this file due to it not being a file I can share.

However, what I see when tracing is that the Do operator seems to end up with a null value for resources in Do.invoke of PdfContentStreamHandler.java (line 967), which throws a NullPointerException:

PdfDictionary dictionary = resources.getAsDict(PdfName.XOBJECT);

If I move back in the call stack, I can see resources2 is null here:

(line 982, Do.invoke of PdfContentStreamHandler) processContent(data, resources2);

Any ideas of what to look for here? Since we also include pdfbox in our application, I've resorted (for now) to using that to extract the text, and that works, but I'd rather use openpdf.

andreasrosdal commented 1 month ago

Hello, can you please share a PDF file where this problem occurs? This will make it easier to make a fix.

The issue you are encountering is related to the resources dictionary sometimes being null. This typically happens if the page does not contain a resources dictionary directly. However, the resources dictionary might be inherited from the parent pages (for example, from a "Pages" dictionary).

Pull requests welcome!