gettalong / hexapdf

Versatile PDF creation and manipulation for Ruby
https://hexapdf.gettalong.org
Other
1.21k stars 69 forks source link

Opening and saving a document loses some check_box-fields #212

Closed andi-dev closed 1 year ago

andi-dev commented 1 year ago

Hi Thomas,

okay, I have an odd one this time. I will send you the corresponding pdf in a moment via email.

When I open the file with hexapdf and write it again, a ticked checkbox disappears, and apparently the form-field / widget as well, as the check_box is no longer clickable.

I don't understand whats wrong, but here are a couple of things I noticed:

  1. When I run acro_form.validate and out put the messages there are twelve times Invalid object in AcroForm field hierarchy. If I interpret the code correctly this error is auto-corrected by simply skipping those "fields".

The "fields" (returned by root_fields in this case are nil-values, but I don't really understand how these field values end up in self[:Fields].

The method find_root_fields! doesn't seem to be called automatically. If I call it manually, the value of :Fields / root_fields changes: instead of references it directly contains annotations, similar to when I call self[:Fields].to_a - only it no longer contains nil-values.

(However, simply calling find_root_field! before writing out the document doesn't solve the issue)

  1. Opening the file in Acrobat, editing anything and saving it again seems to repair the file. Afterwards opening and writing the file with hexapdf no longer causes the checkbox field to disappear.

I am very curious if you have any idea whats happening :)

gettalong commented 1 year ago

Hmm... I concur that file is rather strange.

As for your comments:

gettalong commented 1 year ago

Okay, so running hexapdf info --check test3.pdf shows some problems and looking e.g. at file position 477531 you can see 0 0 R as the value of a dictionary key. The PDF spec says regarding indirect object references in 7.3.10 "The object identifier shall consist of two parts: A positive integer object number. ...", so this is clearly invalid.

There was a recent change where handling of invalid references was corrected. So those errors might not show up in older versions of HexaPDF or different errors might show up.

I inspected a few other error positions and they all show the same problem with 0 0 R.

So this is clearly something invalid but leads in this case to some object not being parsed at all, i.e. to a much bigger problem. I think the best way forward would be to treat references with an object number of 0 as null values. I tested this out and the fields don't disappear anymore.

gettalong commented 1 year ago

Btw. the second file you sent exhibits the same problem with 0 0 R but only in one location, not in multiple like with test3.pdf.

gettalong commented 1 year ago

The change fixing the problem is live on the devel branch.

As for the problem with filled out text not showing up: I think this is related to the fonts that are used in some of the form fields because they are subset fonts, i.e. not containing all characters or all the mappings needed to actually create a visual representation of some Unicode text. Some of those form fields, e.g. the three on the right side of "Änderungsstichtag ab", are actually filled out but nothing is shown in Okular and Evince.

If you add doc.acro_form.create_appearances(force: true) all field appearances are recreated and in case of subset fonts which are not supported by HexaPDF the fallback fonts are used. And then the text shows up (making changes in Okular is still not possible because it used the subset font for this).

So this is nothing that HexaPDF itself can really fix.

andi-dev commented 1 year ago

Nice, I can confirm the issue is fixed :)