PDFs where Root -> AcroForm is a broken reference (resolves to a NullObject) fails to parse

MatthiasValvekens / pyHanko

pyHanko: sign and stamp PDF files

MIT License

483 stars 71 forks source link

PDFs where Root -> AcroForm is a broken reference (resolves to a NullObject) fails to parse #261

Closed peteris-zealid closed 1 year ago

peteris-zealid commented 1 year ago

Consider this issue in stackoverflow https://stackoverflow.com/questions/22909979/itextsharp-acrofields-are-empty

So there are pdfs created with iText that have root object that look like this

<</Type /Catalog /Pages 2 0 R /AcroForm 123456 0 R >>

where the object with id 123456 does not exist.

In this case pyhanko will not create an incremental update by creating the non-existent object, but raise an exception.

These are the lines in question.

    try:
        form = root['/AcroForm']

        try:
            fields = form['/Fields']  # TypeError: NullType is not subscriptable
        except KeyError:
            raise PdfError('/AcroForm has no /Fields')

MatthiasValvekens commented 1 year ago

Interesting. I think I'm OK with calling this a bug. The spec says that missing references are to be processed the same way as nulls, and that nulls in dictionaries are to be processed the same way as missing elements, so at least in nonstrict mode this should work. Will take a look soon.

In the meantime, can you share a sample file or at least share the iText version that last touched your file? If it's a recent one, I'll see if I can poke one of my former colleagues to get it fixed in a future release. ;)

(Having said that, the recent iText 8 release apparently refactors a bunch of form handling code, so there's a decent chance that the underlying bug got fixed en passant.)

peteris-zealid commented 1 year ago

Sadly I cannot share the pdf. There is a good chance that they have fixed it already because the particular pdf was created with iText5 I think.