boazsegev / combine_pdf

A Pure ruby library to merge PDF files, number pages and maybe more...
MIT License
734 stars 156 forks source link

Dynamic XFA flatten? #182

Closed leviwilson closed 3 years ago

leviwilson commented 3 years ago

Is there a way to "flatten" a PDF that has dynamic XFA form data? For our use case, we split the PDF into multiple pages so we can do some downstream processing. At this point, we do not need to manipulate any form data so we just want to "flatten" the PDF with its values and save it as a normal PDF (not a dynamic xfa document). This way, our application can render the PDF in pdfjs.

Is there any way we can do this with combine_pdf? Apologies if the question is unclear as I'm not 💯 familiar with the PDF formats.

In addition to this, is there a way to detect if a file is a dynamic XFA form? It looked like in the catalogs there was an :AcroForm key, but the catalog doesn't look like it's publicly exposed and wasn't sure how I could reliably determine if it was one of these types of files.

leviwilson commented 3 years ago

re: the 2nd question I had about detecting if it was an xfa_form?, was curious if something like this is sufficient:

pdf = CombinePDF.load(path)
!pdf.send(:get_existing_catalogs).dig(0, :AcroForm, :referenced_object, :XFA).nil? # => true if :XFA has a value
boazsegev commented 3 years ago

Hi @leviwilson ,

Thank you for your question. I am sorry to say I don't have good news about form flattening / baking.

CombinePDF attempts to minimize the data it actually needs to parse. For this reason, PDF streams (the contents of the pages) are rarely - if ever - parsed. The closest CombinePDF comes to touching pre-existing content is by renaming the metadata to avoid name collisions when combining PDF files.

For this reason, at the moment, it is impossible to "flatten" PDF forms and make them properly immutable.

As for detecting the a form, rebuilding the catalog object is okay but suboptimal when the catalog is built using the private @forms_data variable that you could probably test for using the read only attribute forms_data (see documentation here).

Good luck! Bo.