LibrePDF / OpenPDF

OpenPDF is a free Java library for creating and editing PDF files, with a LGPL and MPL open source license. OpenPDF is based on a fork of iText. We welcome contributions from other developers. Please feel free to submit pull-requests and bugreports to this GitHub repository.
Other
3.6k stars 596 forks source link

NullPointerException using PdfTextExtractor #1210

Open gtoison opened 3 months ago

gtoison commented 3 months ago

Hello, I'm running into an NPE when using PdfTextExtractor with a file produced by a third party. The code has worked for while but it seems that the third party has updated something and I'm now getting the NPE.

java.lang.NullPointerException: Cannot invoke "com.lowagie.text.pdf.PdfDictionary.getAsDict(com.lowagie.text.pdf.PdfName)" because "resources" is null
    at com.lowagie.text.pdf.parser.PdfContentStreamHandler$SetTextFont.invoke(PdfContentStreamHandler.java:599)
    at com.lowagie.text.pdf.parser.PdfContentStreamHandler.lambda$invokeOperator$0(PdfContentStreamHandler.java:204)
    at java.base/java.util.Optional.ifPresent(Optional.java:178)
    at com.lowagie.text.pdf.parser.PdfContentStreamHandler.invokeOperator(PdfContentStreamHandler.java:204)
    at com.lowagie.text.pdf.parser.PdfContentStreamHandler$Do.processContent(PdfContentStreamHandler.java:989)
    at com.lowagie.text.pdf.parser.PdfContentStreamHandler$Do.invoke(PdfContentStreamHandler.java:976)
    at com.lowagie.text.pdf.parser.PdfContentStreamHandler.lambda$invokeOperator$0(PdfContentStreamHandler.java:204)
    at java.base/java.util.Optional.ifPresent(Optional.java:178)
    at com.lowagie.text.pdf.parser.PdfContentStreamHandler.invokeOperator(PdfContentStreamHandler.java:204)
    at com.lowagie.text.pdf.parser.PdfTextExtractor.processContent(PdfTextExtractor.java:218)
    at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:199)
    at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:177)

I'm getting the error with version 2.0.2

The problems seems to be on that line because resource2 is null: https://github.com/LibrePDF/OpenPDF/blob/00afd24a1e44520dc929187cf3840381f5ea8160/openpdf/src/main/java/com/lowagie/text/pdf/parser/PdfContentStreamHandler.java#L968

The error seems similar to #650

Please let me know if you need further information to help troubleshooting this, thanks in advance!

andreasrosdal commented 3 months ago

Hello, can you please share a PDF file where this problem occurs? This will make it easier to make a fix.

The issue you are encountering is related to the resources dictionary sometimes being null. This typically happens if the page does not contain a resources dictionary directly. However, the resources dictionary might be inherited from the parent pages (for example, from a "Pages" dictionary).

Pull requests welcome!

gtoison commented 3 months ago

Thank you for the answer, the document contains confidential information so I can't unfortunately share it here. I tried making a fix with your suggestion to look for a "Pages" dictionary but ran into the problem that Eclipse won't open it because a maven module "openpdf" has the same name as the project "OpenPDF". I don't have a good connectivity where I am now, I'll try with Intellij

gtoison commented 3 months ago

It does not seem to crash with that change: https://github.com/LibrePDF/OpenPDF/commit/6b515217fd8884ece3e1c2b975730629662265d6 That might be a misguided fix because I don't quite know what the code is supposed to do :)