Can't set value for ChoiceList that has an array of array as option_items

brettwgreen commented 1 year ago

I have an AcroForm PDF with a List Box that has values as an array of arrays... for example, state List Box has

field.option_items
=> 
[["Alabama", "AL"],                                
 ["Alaska", "AK"],                                 
 ["Arizona", "AZ"],                                
 ["Arkansas", "AR"],                               
 ["California", "CA"],                             
 ["Colorado", "CO"],                               
 ["Connecticut", "CT"]
...

No matter how I try and set the value for this, I get a validation error:

field.field_value = 'AL'
field.field_value = 'Alabama'
field.field_value = ['Alabama', 'AL']
# All give me some version of 'Invalid value "AL" for combo_box field'

This seems like a perfectly valid way of defining an AcroForm field... this is a government document I'm working with. When setup this way, the first value in the array of arrays is the stored value, while the second entry is the display value.

Even if I try and bypass the validation and try field.value[:V] = 'Alabama', it does not seem to save it when I write the file to disk.

It just seems that the code around field_value= in the ChoiceBox cannot handle an array of arrays unless I'm missing something. I'm happy to do a Pull Request if you can confirm or I'm just missing something.

gettalong commented 1 year ago

Could you provide the PDF in question so that I can have a look at it?

ChoiceField#option_items should only return an array of the display strings, i.e. ['AL', 'AK', 'AZ' ...]. I'm not sure why it returns an array of arrays. Since this method is also used when setting a field value using #field_value=, it errors out.

brettwgreen commented 1 year ago

https://stjececmsdusgva001.blob.core.usgovcloudapi.net/public/documents/CLJA_Claims_Form.pdf

gettalong commented 1 year ago

Thanks for the file!

I have tried the following script and it works:

require 'hexapdf'

doc = HexaPDF::Document.open(ARGV[0])
field = doc.acro_form.field_by_name('Claimant_State')
p field.option_items
field.field_value = 'OR'
doc.write('/tmp/out.pdf', incremental: true, optimize: true)

The output:

["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY", "CZ-Canal Zone", "GU-Guam", "VI-Virgin Islands", " "]

So I'm not sure what is happening in your case. Could you share some code that produces the problem?

brettwgreen commented 1 year ago

Hmmm... that's odd.

I'm on a Mac? Would that matter? Also I have a full Adobe Acrobat install. Does that have an impact or the presence of other command line tools like pdftk or pdfinfo?

I think my code more or less looks exactly like what you have, but I'll take another look tomorrow.

brettwgreen commented 1 year ago

Wrapped your code in a little function and ran it exactly:

require 'hexapdf'
def test_pdf(path)
  doc = HexaPDF::Document.open(path)
  field = doc.acro_form.field_by_name('Claimant_State')
  p field.option_items
  field.field_value = 'OR'
  doc.write('out.pdf', incremental: true, optimize: true)
end

Then, in irb:

2.7.7 :018 > test_pdf(path)
[["Alabama", "AL"], ["Alaska", "AK"], ["Arizona", "AZ"], ["Arkansas", "AR"], ["California", "CA"], ["Colorado", "CO"], ["Connecticut", "CT"], ["Delaware", "DE"], ["District of Columbia", "DC"], ["Florida", "FL"], ["Georgia", "GA"], ["Idaho", "HI"], ["Illinois", "ID"], ["Illinois", "IL"], ["Iowa", "IN"], ["Iowa", "IA"], ["Kansas", "KS"], ["Kentucky", "KY"], ["Louisiana", "LA"], ["Maine", "ME"], ["Maryland", "MD"], ["Massachusetts", "MA"], ["Michigan", "MI"], ["Minnesota", "MN"], ["Mississippi", "MS"], ["Missouri", "MO"], ["Montana", "MT"], ["Nebraska", "NE"], ["Nevada", "NV"], ["New Hampshire", "NH"], ["New Jersey", "NJ"], ["New Mexico", "NM"], ["New York", "NY"], ["North Carolina", "NC"], ["North Dakota", "ND"], ["Ohio", "OH"], ["Oklahoma", "OK"], ["Oregon", "OR"], ["Rhode Island", "PA"], ["Rhode Island", "RI"], ["South Carolina", "SC"], ["South Dakota", "SD"], ["Tennessee", "TN"], ["Texas", "TX"], ["Utah", "UT"], ["Vermont", "VT"], ["Virginia", "VA"], ["Washington", "WA"], ["West Virginia", "WV"], ["Wisconsin", "WI"], ["Wyoming", "WY"], ["Guam", "CZ-Canal Zone"], ["Guam", "GU-Guam"], ["Virgin Islands", "VI-Virgin Islands"], " "]
Traceback (most recent call last):
        7: from /Users/brett/.rvm/rubies/ruby-2.7.7/bin/irb:23:in `<main>'
        6: from /Users/brett/.rvm/rubies/ruby-2.7.7/bin/irb:23:in `load'
        5: from /Users/brett/.rvm/rubies/ruby-2.7.7/lib/ruby/gems/2.7.0/gems/irb-1.2.6/exe/irb:11:in `<top (required)>'
        4: from (irb):18
        3: from (irb):15:in `test_pdf'
        2: from /Users/brett/.rvm/gems/ruby-2.7.7/gems/hexapdf-0.12.3/lib/hexapdf/type/acro_form/choice_field.rb:134:in `field_value='
        1: from /Users/brett/.rvm/gems/ruby-2.7.7/gems/hexapdf-0.12.3/lib/hexapdf/configuration.rb:353:in `block in <module:HexaPDF>'
HexaPDF::Error (Invalid value "OR" for combo_box field Claimant_State)

Is this happening in the intial parse of the PDF? Or when I call doc.acro_form? Having a hard time navigating those parts of the code to see why it's being parsed different on my side. I thought I could just patch up option_items to allow for an array of arrays, but the problem seems to be further upstream.

Update: On a lark, tried ruby 3.2... same result.

brettwgreen commented 1 year ago

Alright... turns out my require 'hexapdf' was using an older version. Did gem install to get latest globally and now I only see the 'display' items in the array, although now I get a different error trying to write out the file

ruby-3.2.2/gems/hexapdf-0.32.2/lib/hexapdf/document.rb:677:in `block in write': Validation error for (14,0): Invalid size for /U, /O, /UE, /OE or /Perms values for revisions 6 (HexaPDF::Error)

Probably just another issue entirely, so we can probably close this issue.

Update: I was able to get around that by using doc.write('out.pdf', validate: false)... was able to set the state value and looks good in the PDF opened in Acrobat. Some issue there with encryption but almost certainly unrelated.

Update 2: The details of that error looking at debugging of that encryption issue were that value[:U].length and value[:O].length are both 127 with this file. Seems to be a hard validation requirement that these are 48 in the validation code for encryption.

gettalong commented 1 year ago

Thanks for your investigation!

Yes, when using AES 256bit encryption the /O and /U entries need to be 48 bytes long (see table 21 in section 7.6.4.2 of the PDF 2.0 spec) while in the file they are longer. I don't know why they are 127 bytes long. However, there is no real information in these superfluous bytes because they are all zero.

So I think it would be possible to do auto-correction by just truncating the /O and /U fields to their correct size iff the invalid bytes are only zeros.

brettwgreen commented 1 year ago

Thanks so much for your help. I will close the issue.

gettalong commented 1 year ago

@brettwgreen FYI I have implemented the auto-correction for the /O and /U fields, will be available with the next version.

gettalong / hexapdf

Can't set value for ChoiceList that has an array of array as option_items #252