danfickle / openhtmltopdf

An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!
https://danfickle.github.io/pdf-templates/index.html
Other
1.93k stars 359 forks source link

Potential bug - NPE in COSArray.add with big HTML document in PDF A_1_A mode and Arial font #917

Open CedricMtta opened 1 year ago

CedricMtta commented 1 year ago

Hello,

First of all, thanks for the work performed on this lib, it's really appreciated.

I work on a project using openhtmltopdf to generate PDF A_1_A compliant from quite big HTML files. We ran into the following stacktrace in some cases: `java.lang.NullPointerException: Cannot invoke "org.apache.pdfbox.pdmodel.common.COSObjectable.getCOSObject()" because "object" is null

at org.apache.pdfbox.cos.COSArray.add(COSArray.java:62)
at com.openhtmltopdf.pdfboxout.PdfBoxAccessibilityHelper.finishNumberTree(PdfBoxAccessibilityHelper.java:858)
at com.openhtmltopdf.pdfboxout.PdfBoxFastOutputDevice.finish(PdfBoxFastOutputDevice.java:908)
at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.writePDFFast(PdfBoxRenderer.java:674)
at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.createPdfFast(PdfBoxRenderer.java:564)
at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.createPDF(PdfBoxRenderer.java:490)
at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.createPDF(PdfBoxRenderer.java:427)
at com.openhtmltopdf.pdfboxout.PdfBoxRenderer.createPDF(PdfBoxRenderer.java:409)
at com.openhtmltopdf.pdfboxout.PdfRendererBuilder.run(PdfRendererBuilder.java:46)
at com.openhtmltopdf.pdfa.testing.PdfATester.run(PdfATester.java:95)
at com.openhtmltopdf.pdfa.testing.PdfATester.cannotRenderInArial(PdfATester.java:153)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
at com.intellij.rt.junit.IdeaTestRunner$Repeater$1.execute(IdeaTestRunner.java:38)
at com.intellij.rt.execution.junit.TestsRepeater.repeat(TestsRepeater.java:11)
at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:35)
at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:235)
at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:54)

` It happens only when:

If I disable the A_1_A mode, it will work with Arial. If I use Karla instead of Arial, it will work with the A_1_A mode.

I tried to debug a bit to understand what leads to this NPE but without reaching an understanding so far.

I forked your repo from the 1.0.10 release and created a branch to reproduce the issue. You can find it here: https://github.com/CedricMtta/openhtmltopdf/blob/cedricmtta-error-with-big-html-and-some-fonts-reproduction/openhtmltopdf-pdfa-testing/src/test/java/com/openhtmltopdf/pdfa/testing/PdfATester.java

You can run PdfATester#canRenderInArial to see a quite big html document being rendered in Arial and A_1_A compliance successfully. You can run PdfATester#cannotRenderInArial to see the failure described above.

I don't understand how it could work in Karla but not Arial. The main difference I see between those two fonts is that Arial supports nearly any language, while Karla supports only western language.

I wonder whether I'm doing something wrong in the config or if the source HTML is broken, or if it's an issue in openhtmltopdf.

I'd appreciate any help, thanks a lot in advance :)

EDIT: It works with the Open Sans regular font, that can be found here https://www.fontsquirrel.com/fonts/open-sans This work is compatible with non-western characters, like Arial. I still cannot understand why the rendering with Arial throws an NPE :(

CedricMtta commented 1 year ago

We kept digging a bit and it turned out that, with the Arial font and the html document named "cannot-render-in-arial.html", some GenericContentItem were created by the method "ensureAncestorTree". However, this method didn't set correctly the relationship between the latest created structural element and the existing ancestor. The fix can be found here => https://github.com/CedricMtta/openhtmltopdf/commit/4f62a27df439393c8224ac45aa25efbabae09e85#diff-bc184dd44f14edb67050aaf0c0fc0ddd3cbebeb261b77ee1450e2f23b39d5cedR954

(Thanks to @nithril for the help in finding this:))

I assume that working PDF rendering never triggers this function, otherwise the NPE would have occured.

It's still not clear to us why using Arial generate some null AccessilityObject that are then fixed by ensureAncestorTree.

@danfickle we'd be happy to have your opinion on this :)