galkahana / PDF-Writer

High performance library for creating, modiyfing and parsing PDF files in C++
http://www.pdfhummus.com
Apache License 2.0
885 stars 210 forks source link

How to modify existing Annots ? #276

Open devilsclaw opened 3 weeks ago

devilsclaw commented 3 weeks ago

So I am trying to figure out how to modify the Annots for a pdf generated with Libre Draw. I have figure out how to parse it and get the data out of it. Now I would like to then modify the content.

Attached is a sample project with a sample pdf where the Annots have been filled in with a pdf reader. It is able to read an print out all the elements.

I have looked around but there has not really been any simple examples of how to take and the modify Annots. I have seen example of copying and creating new ones but not modify.

pdf_annots.zip

devilsclaw commented 3 weeks ago

I forgot to mention the data that needs modified is V the one with Testing 1 - Testing 51

devilsclaw commented 3 weeks ago

I am not looking to in place edit the info. I know the the style is to copy to a new PDF. But I don't know how to get a copy of the data that then can be modified and then put into the new PDF. The modify part is what I am confused about.

galkahana commented 3 weeks ago

well, i can probably arrange for an example if i get to it, but a solution to a problem like that with this lib goes through:

the test example of ModifyingExistingFileContent.cpp shows a common approach to modifying part of an element (pages mostly), where a new version is created, most of its content is just copied, but the elements that you want to change, which are created as par what change you want to make.

If you haven't yet done that, try going through the modification documentation here - https://github.com/galkahana/PDF-Writer/wiki/Modification.

Depending on how the week goes, i'll try to get around and setup a more comprehensive example per what you're trying to do.

galkahana commented 3 weeks ago

ok. actually seems like this is just a form and its widgets annotations. in this case you can read the examples for either filling a form or locking it. there's quite a few over at hummusjs (the legacy nodejs interface of this library), and the operators are fairly similar. see if it helps you, otherwise we can discuss details.

here's the example: https://github.com/galkahana/HummusJSSamples/blob/master/filling-form-values/pdf-form-fill.js

its a general purpose example for how to create a modified file version of an existing file...and also in particular about how to fill form values...in case this sort of thing interest you.

devilsclaw commented 3 weeks ago

Thanks. I will look into it.

devilsclaw commented 3 weeks ago

I looked at the JS script and currently working to port it to C++ and the modify it to work with my forms if needed I noticed a bug in it which it will ignore fields that its supposed to change. I would post it in the repo but it looks like its never checked.

var data = {
    "Given Name Text Box": "Eric",
    "Family Name Text Box": "Jones",
    "House nr Text Box": "someplace",
    "Address 1 Text Box": "somewhere 1",
    "Address 2 Text Box": "somewhere 2",
    "Postcode Text Box": "123456",
    "Country Combo Box": "Spain",
    "Height Formatted Field": "198",
    "Driving License Check Box": true,
    "Favourite Colour List Box": "Brown",
    "Language 1 Check Box": true,
    "Language 2 Check Box": true,
    "Language 3 Check Box": false,
    "Language 4 Check Box": false,
    "Language 5 Check Box": true,
    "Gender List Box": "Man"
};

The ones with false will not be process. due to the line below

if (handles.data[fullName]) {

It should be

if (handles.data[fullName] != undefined) {

devilsclaw commented 3 weeks ago

So I am trying to figure out the C++ equivalent to this

if(handles.acroformDict.exists('DR')) {
    handles.writer.getEvents().once('OnResourcesWrite',function(args){
        // copy all but the keys that exist already
        var dr = handles.reader.queryDictionaryObject(handles.acroformDict,'DR').toPDFDictionary().toJSObject();
            Object.getOwnPropertyNames(dr).forEach(function(element,index,array) {
                if (element !== 'ProcSet' && (!textOptions || element !== 'Font')) {
                    args.pageResourcesDictionaryContext.writeKey(element);
                    handles.copyingContext.copyDirectObjectAsIs(dr[element]);
                }
            });
    });
}

I was thinking it might be AddDocumentContextExtender but that seems to be the entire document type thing where the above seems to only happens once for each instance.

devilsclaw commented 3 weeks ago

Not sure but would this work ?

  if(handles.acroformDict->Exists("DR")) { //maybe
    // copy all but the keys that exist already
    PDFObjectCastPtr<PDFDictionary> dr = handles.reader.QueryDictionaryObject(handles.acroformDict.GetPtr(), "DR");

    MapIterator<PDFNameToPDFObjectMap> it = dr->GetIterator();
    RefCountPtr<PDFName> key;
    PDFObject* value;
    DictionaryContext* page_out_dic = handles.writer.GetObjectsContext().StartDictionary();
    while(it.MoveNext()) {
      key = it.GetKey();
      value = it.GetValue();
      if(key->GetValue() != "ProcSet" && (textOptions == NULL || key->GetValue() != "Font")) {
        page_out_dic->WriteKey(key->GetValue());
        handles.copyingContext->CopyDirectObjectAsIs(value);
      }
    }
  }
devilsclaw commented 3 weeks ago

So here is the C++ port which is like 95 to 99 % ported. If fills out the test pdf the same amount as the original js example does. It also worked on my PDF. So thanks for the pointer.

https://github.com/devilsclaw/pdf_form_fill

https://github.com/devilsclaw/pdf_form_fill/blob/main/pdf_form_fill.h

galkahana commented 3 weeks ago

awesome :) glad it worked out.

devilsclaw commented 2 weeks ago

I also made a pdf_info tool that parses the whole PDF and prints all elements in a human readable form other then input streams which are printed in hex notation since it can store anything. This would of been really helpful for me originally so I made available as well.

NOTE: Indirect's are also printed at the end of each page since they can point to each other in an infinite loop. So they are handle differently.

Small clipped sample below even a small PDF is pages long if I would put the whole parse here

PDF Header level = 1.400000
Number of objects in PDF = 63
Number of pages in PDF = 1

// Showing info for Page 0 //////////////////////////////////////////////////////////
Showing info for page 0:
ePDFObjectDictionary: 
  key = Annots
    ePDFObjectArray: 
      ePDFObjectDictionary: 
        key = AP
          ePDFObjectDictionary: 
            key = N
              ePDFObjectIndirectObjectReference: value = 38
        key = DA
          ePDFObjectLiteralString: value = 0 0 0 rg /F3 11 Tf
        key = DR
          ePDFObjectDictionary: 
            key = Font
              ePDFObjectIndirectObjectReference: value = 6
        key = DV
          ePDFObjectHexString: value =
        key = F
          ePDFObjectInteger: value = 4
        key = FT
          ePDFObjectName: value = Tx
        key = MaxLen
          ePDFObjectInteger: value = 40
        key = P
          ePDFObjectIndirectObjectReference: value = 1
        key = Rect
          ePDFObjectArray: 
            ePDFObjectReal: value = 165.700000
            ePDFObjectReal: value = 453.700000
            ePDFObjectReal: value = 315.700000
            ePDFObjectReal: value = 467.900000
        key = Subtype
          ePDFObjectName: value = Widget
        key = T
          ePDFObjectLiteralString: value = Given Name Text Box
        key = TU
          ePDFObjectHexString: value = First name
        key = Type
          ePDFObjectName: value = Annot
        key = V
          ePDFObjectHexString: value =

Indirect example

ePDFObjectIndirectObjectReference: Start : value = 35
  ePDFObjectDictionary: 
    key = Ascent
      ePDFObjectInteger: value = 905
    key = CapHeight
      ePDFObjectInteger: value = 1005
    key = Descent
      ePDFObjectInteger: value = -211
    key = Flags
      ePDFObjectInteger: value = 4
    key = FontBBox
      ePDFObjectArray: 
        ePDFObjectInteger: value = -664
        ePDFObjectInteger: value = -324
        ePDFObjectInteger: value = 2000
        ePDFObjectInteger: value = 1006
    key = FontName
      ePDFObjectName: value = ArialMT
    key = ItalicAngle
      ePDFObjectInteger: value = 0
    key = StemV
      ePDFObjectInteger: value = 80
    key = Type
      ePDFObjectName: value = FontDescriptor
ePDFObjectIndirectObjectReference: End   : value = 35

https://github.com/devilsclaw/pdf_info/

https://github.com/devilsclaw/pdf_info/blob/main/pdf_info.cpp