gettalong / hexapdf

Versatile PDF creation and manipulation for Ruby
https://hexapdf.gettalong.org
Other
1.21k stars 69 forks source link

Merging removes access to AcroForm #253

Closed thomasbaustert closed 1 year ago

thomasbaustert commented 1 year ago

The company I work for has a license and we are currently integrating HexaPDF. I basically use the following code to merge PDF documents:

def call(pdf_documents)
  target = HexaPDF::Document.new

  pdf_documents.each do |pdf_document|
    pdf = HexaPDF::Document.new(io: StringIO.new(pdf_document))
    pdf.pages.each { |page| target.pages << target.import(page) }
  end

  output = StringIO.new
  target.write(output)
  output.string
end

The first document contains an AcroForm. I would like to read the value of fields from the resulting PDF to check if the PDF is filled correctly. Basically:

doc = HexaPDF::Document.new(io: StringIO.new(content))
doc.acro_form.field_by_name(name)

But I got "NoMethodError: undefined method `field_by_name' for nil:NilClass". Looks like the AcroForm is not accessable anymore.

When open the PDF in "Wondershare PDFelement Pro" the fields are still exist. Without merging the resulting PDF allows to access the form.

Is it possible to keep the "access" to the AcroForm? Or am I doing something wrong?

Unfortunately I cannot provide the PDF because it contains sensitive data. I can try to get one for testing if needed.

Thanks!

thomasbaustert commented 1 year ago

Hm, it also looks like the merging changes the font of the form fields from Arial to Calibri!?

gettalong commented 1 year ago

The code only imports the pages from the pdf_documents into a new and empty PDF document. AcroForm fields are defined as widget annotations, so are available via a page's /Annot entry and are therefore imported. However, the AcroForm itself is stored in the document catalog which you don't merge.

So what I would do in your case is using the first document with the AcroForm as base document and importing the pages of the rest of the documents into that base document. Then the AcroForm will still be there with all the necessary information.

As for font changes: Since the main AcroForm object is missing, text fields that rely on the main AcroForm object for font information won't work correctly. So once the main AcroForm object is available, the fonts should be fine.

thomasbaustert commented 1 year ago

I merged the documents as recommended and it works. Thanks for the quick feedback and clarification!

Lets assume we have [a, f, b] with f as PDF/page containing the AcroForm. Can I merge the document catalog of f too? Or do I have to merge it as [f, a, b]? Thanks!

gettalong commented 1 year ago

If you want to do it the easy way, you would need to do [f, a, b]. Otherwise, you would need to check whether there is an existing AcroForm object already from a and manually merge the information from f into it, making sure to create unique field names to avoid problems.

thomasbaustert commented 1 year ago

I just hope case [a, f, b] does not happen :). Thanks.

gettalong commented 1 year ago

You can easily check via doc.acro_form - if it returns nil, then there is no AcroForm in the document.

gettalong commented 1 year ago

@thomasbaustert I think this issue is solved. If not, please let me know.