galkahana / HummusJS

Node.js module for high performance creation, modification and parsing of PDF files and streams
http://www.pdfhummus.com
Other
1.15k stars 170 forks source link

pdf form fills #57

Closed mwaschkowski closed 7 years ago

mwaschkowski commented 8 years ago

Hi,

I've looked through the docs and didn't see anything mentioned, is it possible to use HummusJS to fill in a PDF form? If so, is there a call to do so, or would I have to iterate every form field and the modify it manually?

Thank you!

Mark

galkahana commented 8 years ago

Manual modofication is the way to go im afraid. Hummus can help with parsing and writing but theres no higher level method. Look for the acroform object at the catalog and iterate from there. Let me know if you need help. Havent done this but may be able to help (plus its time i have an example/module for this...getting questions about this all the time).

Gal

mwaschkowski commented 8 years ago

OK, great, thanks for the quick reply. I'll try it out and let you know how it goes, over the weekend.

Have a great day, and thank you for such a great library!

Mark

On Wed, Dec 16, 2015 at 11:08 AM, gal kahana notifications@github.com wrote:

Manual modofication is the way to go im afraid. Hummus can help with parsing and writing but theres no higher level method. Look for the acroform object at the catalog and iterate from there. Let me know if you need help. Havent done this but may be able to help (plus its time i have an example/module for this...getting questions about this all the time).

Gal

— Reply to this email directly or view it on GitHub https://github.com/galkahana/HummusJS/issues/57#issuecomment-165157584.

fedyunin commented 8 years ago

Hello! I work with Mark on this task about fills PDF fields.

I little stuck on figure out how this should work in common way. I reviewed docs on wiki regarding to modification. And example with modifying page and adding comments.

So looks like I can't just take object, set necessary value for it (fill it).

I need get all fields objects (fields from acroform catalog) from source pdf file, then create the same objects (with the same properties) in modified pdf file. And for these new object to set necessary value (fill it). Is this right direction or not? Let me know if something not clear I will try explain it better.

Thanks!

galkahana commented 8 years ago

Absolutely correct. to modify an object you need to recreate a full version of it, copying what needs to remain the same and change what is to be changed.

fedyunin commented 8 years ago

Hi!

I retrieved necessary objects from Acroform - fields array. var catalog = pdfReader.queryDictionaryObject(pdfReader.getTrailer(), 'Root'); var acroform = pdfReader.queryDictionaryObject(catalog, 'AcroForm'); var fieldsRoot = pdfReader.queryDictionaryObject(acroform, 'Fields');

Then, I implemented recreate a full version of objects which I need to modify (fields). So now I have a set of objectIDs.

On next step I need to update Acroform with these new objects.

Could you help me with this? How to update/recreate Acroform?

Thanks!

galkahana commented 8 years ago

hummfff. assuming that acroform is a pointer to a remote object you can recreate it by creating a modified version of the acroform object. see example here of how to create a modified version of a page object...you can see from there what's relevant to your case: https://github.com/galkahana/HummusJS/blob/master/tests/ModifyingExistingFileContent.js#L150 [if not, i can point you further]

if it's a direct object...then this calls for modifying the catalog object so that it has a new definition of an acroform object, which is similar again to creating a new modified version.

[i will know best if you send me a sample PDF]

Gal.

fedyunin commented 8 years ago

Hello!

Main issue here that I want to recreate Fields array of Acroform object - how this done the example (link you provided: modifying page object for write Annots). But I cannot get access to Acroform object for create modified version of it.

E.g for modify page object we can get it ID - e.g: copyingContext.getSourceDocumentParser().getPageObjectID(2);

Next: we recreate page object, copying objects which no need to change. Then write modified Annots object etc.

Working with Acroform object I faced with issue that I can't get it ID.

I getting it via parser:

var catalog = pdfReader.queryDictionaryObject(pdfReader.getTrailer(), 'Root'); var acroform = pdfReader.queryDictionaryObject(catalog, 'AcroForm'); var fieldsRoot = pdfReader.queryDictionaryObject(acroform, 'Fields');

I can't create modified version of it (by using it ID).

Do you need sample PDF file which I try to modify or my source code?

Thanks!

galkahana commented 8 years ago

ah. i see. ok. to get the acroform id use the direct dictionaery access instead of going through pdfReader.queryDictionaryObject. like this:

var acroformID = catalog.queryObject('AcroForm').toPDFIndirectObjectReference().getObjectID();

i'm trusting here that the AcroForm value is an indirect object reference, and so getObjectID would work. try first...if doesn't work let's look to do something else.

Even if you don't have more questions - I would really love to get a sample PDF file with a form so i can make my own tests. more so, if you do.

Gal.

fedyunin commented 8 years ago

thank you, I will try it. My sample PDF file: Rockwood - Cyber Liability Insurance App.pdf

fedyunin commented 8 years ago

So I used this way for getting acroformID. I can write Fields objects by this way:

objectsContext.startModifiedIndirectObject(acroformID);
var modifiedAcroformObject = pdfWriter.getObjectsContext().startDictionary();
var copyingContext = pdfWriter.createPDFCopyingContextForModifiedFile();
Object.getOwnPropertyNames(acroformObject).forEach(function(element, index, array) {
    if (element != 'Fields') {
        modifiedAcroformObject.writeKey(element);
        copyingContext.copyDirectObjectAsIs(acroformObject[element]);
    }
});

modifiedAcroformObject.writeKey('Fields');
objectsContext.startArray();

objectsContext.writeIndirectObjectReference(fieldObjectID001);
objectsContext.writeIndirectObjectReference(fieldObjectID002);
...
objectsContext.writeIndirectObjectReference(fieldObjectID_n);

objectsContext
    .endArray()
    .endLine()
    .endDictionary(modifiedAcroformObject)
    .endIndirectObject();  
galkahana commented 8 years ago

looks good to me. is it working?

fedyunin commented 8 years ago

yes, it works. remains correctly write value for field (fill field), but this is related to PDF specs.

thanks!

galkahana commented 8 years ago

cool

galkahana commented 8 years ago

word of advice here. i see text fields accept a "text string" for values. which is the pdf encoding method. you can use the PDFTextString class provided by hummus to do the encoding for you.

For instance, say you want to write the "V" part, you can do something like this:

modifiedFieldObject.writeKey('V');
modifiedFieldObject.writeLiteralString(new PDFTextString('hello').toBytesArray());

[i mean...never tried specifically this...but it should work. at least, worth a try].

Gal.

fedyunin commented 8 years ago

Hello! Yes this way(above) for write LiteralStringValue works. Thanks!

But looks like I stuck with filling values to fields.

I implemented rewrite objects from Fields array of Acroform objects. I can correctly write value (V) for necessary fields. Example:

if (fieldJSDict.FT.value === "Btn") {
 dictionaryContext.writeKey("V").writeNameValue(value); 
 dictionaryContext.writeKey("AS").writeNameValue(value);
} else {
 dictionaryContext.writeKey('V').writeLiteralStringValue(new hummus.PDFTextString(value).toBytesArray());
}

But result PDF not looks good. Checkboxes/RadioButtons not filled. Text fields filled, but by some strange way:

One field filled, but others empty: image

When I click on field, will be showed filled value: image

Questions:

  1. I think that maybe part of these issues connected with incorrect way which I use to recreate field objects. I recreate all objects tree, but some attributes in objects can refer to incorrect objectIDs (PDFIndirectObject) - Because they refer to obsolete objectIDs, but I created a new objects.

Question: can I use by some way: copyingContext.copyDirectObjectAsIs() when I recreate field objects?

  1. While do research about filling PDF I noticed that in other libs after fill fields sometimes do "flatten" for PDF. Could you explain this? Do I need this too? Is there some similar thing in hummusJS?

My result PDF file: rockwood--cyber-liability-insurance-app.pdf_modified.pdf

Thanks!

galkahana commented 8 years ago

Hi Alexey,

Good job on the text...that's one part working well.

Checkboxes look into the value that you provide for the checkbox in:

dictionaryContext.writeKey("V").writeNameValue(value); 
dictionaryContext.writeKey("AS").writeNameValue(value);

They should match relevant appearance streams. It think that you can assume that most use Yes and Off respectively. (but i guess your input value is probably something like boolean true/false.

Text fields and weird copying behavior I would be in a better position to answer if there's a problem with the fields copying if i'll have a code sample of what you are trying to do.

In general, My suggestion would be not to create new objects, but rather create modified versions. This way the referenced object ID remains the same and you don't need to worry about referencing obsolete IDs [note that this is what you do to the acroform element, right?].

best would be to receive some sample code for you that i can help with debugging.

i'm attaching here some code that i just wrote for parsing form fields values. This way (and if i don't have bugs) you should be able to tell if the values are set as you intended.

Regards, Gal. test.js.zip

fedyunin commented 8 years ago

Hi Gal!

Here my code:

                .......
                var copyingContext = pdfWriter.createPDFCopyingContextForModifiedFile();
                var objectsContext = pdfWriter.getObjectsContext();

                var pdfReader = copyingContext.getSourceDocumentParser();
                var catalog = pdfReader.queryDictionaryObject(pdfReader.getTrailer(), 'Root');
                var acroform = pdfReader.queryDictionaryObject(catalog, 'AcroForm');
                var fieldsRoot = pdfReader.queryDictionaryObject(acroform, 'Fields');

                var objects = []; // updated fields object IDs
                var allFieldObjects = [];
                getFieldsObjects(fieldsRoot, allFieldObjects, pdfReader); // see implementation below

                for (var i = 0; i < allFieldObjects.length; ++i) {
                    var fieldJSDict = allFieldObjects[i];
                    var newFieldObject = objectsContext.startNewIndirectObject();
                    objects.push(newFieldObject);
                    var dictionaryContext = objectsContext.startDictionary();
                    writeObjectToContext(dictionaryContext, fieldJSDict, pdfWriter); // see implementation below

                    var name = recursiveBuildFieldName(pdfReader, fieldJSDict, fieldJSDict.T.toText());
                    var value = body[name];

                    if (body.hasOwnProperty(name)) {
                        if (fieldJSDict.FT.value === "Btn") {
                            dictionaryContext.writeKey("V").writeNameValue(value);
                            dictionaryContext.writeKey("AS").writeNameValue(value);
                        }
                        else {
                            dictionaryContext.writeKey('V').writeLiteralStringValue(new hummus.PDFTextString(value).toBytesArray());
                        }
                    }

                    objectsContext
                        .endDictionary(dictionaryContext)
                        .endIndirectObject();

                }

                var acroformID = catalog.queryObject('AcroForm').toPDFIndirectObjectReference().getObjectID();
                var acroformObject = pdfReader.parseNewObject(acroformID).toJSObject();

                objectsContext.startModifiedIndirectObject(acroformID);
                var modifiedAcroformObject = pdfWriter.getObjectsContext().startDictionary();
                Object.getOwnPropertyNames(acroformObject).forEach(function(element, index, array) {
                    if (element != 'Fields') {
                        modifiedAcroformObject.writeKey(element);
                        copyingContext.copyDirectObjectAsIs(acroformObject[element]);
                    }
                });
                modifiedAcroformObject.writeKey('Fields');
                objectsContext.startArray();

                for (var i = 0; i < objects.length; i++) {
                    objectsContext.writeIndirectObjectReference(objects[i]);
                }

                objectsContext
                    .endArray()
                    .endLine()
                    .endDictionary(modifiedAcroformObject)
                    .endIndirectObject();

                pdfWriter.end();

/**
 * Recursively walk through sourceFieldsArray and extract all Fields objects to fields. *
 */
function getFieldsObjects(sourceFieldsArray, fields, pdfReader) {
    for (var i = 0; i < sourceFieldsArray.getLength(); ++i) {
        var origObj = pdfReader.queryArrayObject(sourceFieldsArray, i);
        var fieldJSDict = origObj.toJSObject();
        if (fieldJSDict.FT) {
            fields.push(fieldJSDict);
        }
        if (fieldJSDict.Kids) {
            getFieldsObjects(fieldJSDict.Kids, fields, pdfReader);
        }
    }
}

/**
 * Write field object to context. Writes dictionary object to context. Recursively. 
 */
function writeObjectToContext(dictionaryContext, fieldJSDict, pdfWriter) {
    Object.getOwnPropertyNames(fieldJSDict).forEach(function(element, index, array) {
        if (element != "V" && element != "AS") {
            var value = fieldJSDict[element];
            var type = value.constructor.name;
            try {
                type = value.getType();
            }
            catch (e) {
                // skip
                console.info("error");
            }
            console.info("Write key to context:" + element + " type:" + value.constructor.name);
            var objectsContext = pdfWriter.getObjectsContext();
            switch (type) {
                case hummus.ePDFObjectDictionary:
                    dictionaryContext.writeKey(element);
                    var apDict = objectsContext.startDictionary();
                    writeObjectToContext(apDict, value.toJSObject(), pdfWriter);
                    objectsContext.endDictionary(apDict);
                    break;
                case hummus.ePDFObjectIndirectObjectReference:
                    dictionaryContext.writeKey(element).writeObjectReferenceValue(value.getObjectID());
                    break;
                case hummus.ePDFObjectLiteralString:
                    dictionaryContext.writeKey(element).writeLiteralStringValue(pdfWriter.createPDFTextString(value.value).toBytesArray());
                    break;
                case hummus.ePDFObjectInteger:
                case hummus.ePDFObjectReal:
                    dictionaryContext.writeKey(element);
                    objectsContext.writeNumber(value.value);
                    break;
                case hummus.ePDFObjectName:
                    dictionaryContext.writeKey(element).writeNameValue(value.value);
                    break;
                case hummus.ePDFObjectArray:
                    dictionaryContext.writeKey(element);
                    if (element == "Rect") {
                        dictionaryContext.writeRectangleValue(value.toJSArray()[0].value, value.toJSArray()[1].value, value.toJSArray()[2].value, value.toJSArray()[3].value);
                    }
                    else {
                        objectsContext.startArray();
                        var arrayJs = value.toJSArray();
                        for (var k = 0; k < arrayJs.length; k++) {
                            var item = arrayJs[k];
                            try {
                                switch (item.getType()) {
                                    case hummus.ePDFObjectInteger:
                                        objectsContext.writeNumber(value);
                                        break;
                                    case hummus.ePDFObjectIndirectObjectReference:
                                        objectsContext.writeObjectReferenceValue(value.getObjectID());
                                        break;
                                    default:
                                        console.error("Need to implement write array element value for type: " + item.constructor.name);
                                        break;
                                }
                            }
                            catch (e) {
                                writeObjectToContext(dictionaryContext, item, pdfWriter);
                            }
                        }
                        objectsContext.endArray().endLine();
                    }
                    break;
                case "Number":
                    objectsContext.writeNumber(value);
                    break;
                default:
                    console.error("Need to implement write to context for type:" + type);
                    break;
            }
        }
    });

}
galkahana commented 8 years ago

Hi, Thanks for the code. looks wonderful. some notes:

  1. As noted in my previous comment: make sure that the values that you provide for checkboxes are names of appearance streams. should be something like Yes or Off. i'm fairly sure that's the problem (you probably get true/false from body[name]).
  2. While it's possible that writeObjectToContext does a good job, i think that you can instead use the same method as used in the main code for recreating the acroform. namely, instead of the call to writeObjectToContext use: Object.getOwnPropertyNames(fieldJSDict).forEach(function(element, index, array) { if (element != 'V' || element != 'AS') { dictionaryContext.writeKey(element); copyingContext.copyDirectObjectAsIs(fieldJSDict[element]); } });

I would recommend placing conditions - for text box condition only the V, for checkbox both V and AS. I would also recommend making tighter check on the chekbox types. not just that they are buttons. unless you have knowledge that all BTN types are checkboxes in your form. Just check the Ff value (you can look at the code i sent you to realize how to verify that).

I would also recommend to condition the modified copying version by if(body[name]). this way fiedls that don't get autocomplete will regain their original value. (for them just copy the whole object as is, and don't bother with checking V or AS.

  1. Seems like the algorithm in getFieldsObjects will scan correctly for fields in the acroform. might wanna do a print there to make sure you got all the fields that you intended to get. To avoid creating references to obsolete IDs, as Alexey suspects that may happen, you may want to collect not just the dict, but also, in case it's referenced via indirect object reference, it's object ID. later, when recreating the object, if it is originally used as indirect object reference, don't use startNewIndirectObject. rather use startModifiedIndirectObject with the original reference ID. This way for indirect object fields, you will retain the same reference ID, and so no obsolete IDs, and for fields that are direct objects...well...no need to worry about someone else referencing thiem...so no problem here.
  2. for fields that missing, and come to view when you click on them...don't really know. might be that they don't fall into the definition of variable data and so don't have a variable data appearance stream...but first let's verify that the rest works fine, and we'll look into that then.

Gal.

galkahana commented 8 years ago

and writing should probably maintain the heirarchy, no? i see reading is recursive, but writing just write all the objects under the main fields array. you should probably only have to write the top level ones.

fedyunin commented 8 years ago

Thank you for advices! I will rework my code according to them and will let you know.

fedyunin commented 8 years ago

Hello! Ok I did changes according to your notes. What I have at this point:

........
                var pdfWriter = hummus.createWriterToModify(pdfFileName, {
                    modifiedFilePath: pdfFileName + "_modified.pdf"
                });

                var copyingContext = pdfWriter.createPDFCopyingContextForModifiedFile();
                var objectsContext = pdfWriter.getObjectsContext();

                var pdfReader = copyingContext.getSourceDocumentParser();
                var catalog = pdfReader.queryDictionaryObject(pdfReader.getTrailer(), 'Root');
                var acroform = pdfReader.queryDictionaryObject(catalog, 'AcroForm');
                var fieldsRoot = pdfReader.queryDictionaryObject(acroform, 'Fields');

                var objects = []; // updated fields object IDs

                writeValuesToFields(pdfReader, pdfWriter, copyingContext, objectsContext, fieldsRoot, body, objects);

                var acroformID = catalog.queryObject('AcroForm').toPDFIndirectObjectReference().getObjectID();
                var acroformObject = pdfReader.parseNewObject(acroformID).toJSObject();

                objectsContext.startModifiedIndirectObject(acroformID);
                var modifiedAcroformObject = pdfWriter.getObjectsContext().startDictionary();
                Object.getOwnPropertyNames(acroformObject).forEach(function(element, index, array) {
                    if (element != 'Fields') {
                        modifiedAcroformObject.writeKey(element);
                        copyingContext.copyDirectObjectAsIs(acroformObject[element]);
                    }
                });
                modifiedAcroformObject.writeKey('Fields');
                objectsContext.startArray();

                for (var i = 0; i < objects.length; i++) {
                    objectsContext.writeIndirectObjectReference(objects[i]);
                }

                objectsContext
                    .endArray()
                    .endLine()
                    .endDictionary(modifiedAcroformObject)
                    .endIndirectObject();

                pdfWriter.end();

I decided write objects in the context at the same time when I iterate them. (Do not collect them at first step). Also I actively use startModifiedIndirectObject(objID) - do not change objectID. And I use detect type of field - for apply to it custom logic for write value:

function writeValuesToFields(pdfReader, pdfWriter, copyingContext, objectsContext, objects, fieldsToModify, updatedObjectIDs) {
    for (var i = 0; i < objects.getLength(); ++i) {
        var obj = pdfReader.queryArrayObject(objects, i);
        var objID = objects.toJSArray()[i].getObjectID();
        var fieldJSDict = obj.toJSObject();

        updatedObjectIDs.push(objID);

        var processKids = true;

        if (fieldJSDict.T) {
            var name = recursiveBuildFieldName(pdfReader, fieldJSDict, fieldJSDict.T.toText());

            if (fieldsToModify.hasOwnProperty(name)) { // should we for this object write value?

                if ("topmostSubform[0].Page1[0].RadioButtonList[0]" === name) {
                    console.info("break here.");
                }

                var value = fieldsToModify[name];
                var type = getFieldType(fieldJSDict); // I created method for detect type of field, I used your code which you wrote in the test.js file.
                //yes
                console.info("Process field: " + name);

                objectsContext.startModifiedIndirectObject(objID);
                var dictionaryContext = objectsContext.startDictionary();

                var skip = "";
                switch (type) {
                    case FIELD_TYPES.TEXT:
                        dictionaryContext.writeKey('V').writeLiteralStringValue(new hummus.PDFTextString(value).toBytesArray());
                        break;
                    case FIELD_TYPES.RADIO: // probably for radio need custom logic for write value:
                        dictionaryContext.writeKey("V").writeNameValue(value);
                        dictionaryContext.writeKey("AS").writeNameValue(value);
                        skip = "AS";
                        break;
                    case FIELD_TYPES.CHECKBOX:
                        dictionaryContext.writeKey("V").writeNameValue(value);
                        dictionaryContext.writeKey("AS").writeNameValue(value);
                        skip = "AS";
                        break;
                }

                writeObjectToContext(dictionaryContext, fieldJSDict, pdfWriter, skip);

                objectsContext
                    .endDictionary(dictionaryContext)
                    .endIndirectObject();
            }
            else {
                //no
                console.info("Just copy field: " + name);
                copyingContext.copyDirectObjectAsIs(obj);
            }
        }

        if (fieldJSDict.Kids && processKids) {
            writeValuesToFields(pdfReader, pdfWriter, copyingContext, objectsContext, fieldJSDict.Kids, fieldsToModify, updatedObjectIDs);
        }
    }
}

This is function little changed:

/**
 * Write field object to context. Writes dictionary object to context. Recursively. 
 */
function writeObjectToContext(dictionaryContext, fieldJSDict, pdfWriter, skip) {
    Object.getOwnPropertyNames(fieldJSDict).forEach(function(element, index, array) {
        if (element !== "V" && (!skip || element !== skip)) {
            var value = fieldJSDict[element];
            var type = value.constructor.name;
            try {
                type = value.getType();
            }
            catch (e) {
                // skip
                console.info("error");
            }
            console.info("Write key to context:" + element + " type:" + value.constructor.name);
            var objectsContext = pdfWriter.getObjectsContext();
            switch (type) {
                case hummus.ePDFObjectDictionary:
                    dictionaryContext.writeKey(element);
                    var apDict = objectsContext.startDictionary();
                    writeObjectToContext(apDict, value.toJSObject(), pdfWriter, skip);
                    objectsContext.endDictionary(apDict);
                    break;
                case hummus.ePDFObjectIndirectObjectReference:
                    dictionaryContext.writeKey(element).writeObjectReferenceValue(value.getObjectID());
                    break;
                case hummus.ePDFObjectLiteralString:
                    dictionaryContext.writeKey(element).writeLiteralStringValue(pdfWriter.createPDFTextString(value.value).toBytesArray());
                    break;
                case hummus.ePDFObjectInteger:
                case hummus.ePDFObjectReal:
                    dictionaryContext.writeKey(element);
                    objectsContext.writeNumber(value.value);
                    break;
                case hummus.ePDFObjectName:
                    dictionaryContext.writeKey(element).writeNameValue(value.value);
                    break;
                case hummus.ePDFObjectArray:
                    dictionaryContext.writeKey(element);
                    if (element == "Rect") {
                        dictionaryContext.writeRectangleValue(value.toJSArray()[0].value, value.toJSArray()[1].value, value.toJSArray()[2].value, value.toJSArray()[3].value);
                    }
                    else {
                        objectsContext.startArray();
                        var arrayJs = value.toJSArray();
                        for (var k = 0; k < arrayJs.length; k++) {
                            var item = arrayJs[k];
                            try {
                                switch (item.getType()) {
                                    case hummus.ePDFObjectInteger:
                                        objectsContext.writeNumber(item.value);
                                        break;
                                    case hummus.ePDFObjectIndirectObjectReference:
                                        objectsContext.writeIndirectObjectReference(item.getObjectID());
                                        break;
                                    default:
                                        console.error("Need to implement write array element value for type: " + item.constructor.name);
                                        break;
                                }
                            }
                            catch (e) {
                                writeObjectToContext(dictionaryContext, item, pdfWriter, skip);
                            }
                        }
                        objectsContext.endArray().endLine();
                    }
                    break;
                case "Number":
                    objectsContext.writeNumber(value);
                    break;
                default:
                    console.error("Need to implement write to context for type:" + type);
                    break;
            }
        }
    });

}

Field values I recieve after submitting HTML form. This form was created before, via parsing source PDF file. We also use hummusJS for parse fields and generate HTML.

So I have map-object "body" with pairs:

= For checkboxes it contains values of their AP(appearance stream) - AP object for showing checked state. With this all fine. I can just use this value. For radios the same. e.g: body: { "checkbox1":"Yes", "radio1":"1", "checkbox2:":"2" // 2 - the name of AP object which related to ON state. "text":"test"} After these changes I finally can fill checkboxes and radios! Unfortunately strange behaviour for text fields (text in field not showed until cursor not focused on it) still exist. But it happens not for all PDF files. Probably need do some research at this issue. I hope this can be fixed. Also still remains issue with fill radios (for some PDF files). Probably in this case need check for some issues at parsing stage (when I retrieve AP objects and getting their names for use as value). Thanks!
galkahana commented 8 years ago

Possible that the empty text fields dont have defs for variable text, somehow, in which case you may want to add the required appearance stream template.

Good going! Gal

fedyunin commented 8 years ago

Hello! Yes, right for correctly show text fields value need have a correct "appearance stream".

I tried update existing AP by this way:

objectsContext.startModifiedIndirectObject(apObjectId); // apObjectID - ID of AP for update
// create Dictionary for write to stream:
var apDict = objectsContext.startDictionary();

// copy / create keys/values from correct AP (which show text) or create a new one:
<copy/create dict keys etc>

objectsContext.endDictionary(apDict);

var streamCxt = objectsContext.startPDFStream(apDict); // this throws segmenation fault error.
objectsContext.endPDFStream(streamCxt);

How to write dictionary to stream I found in the docs: https://github.com/galkahana/HummusJS/wiki/Extensibility#pdf-streams

Could you clarify please how I can modify existing AP (e.g: write resources: Font, etc)? Main issue here how to write dictionary to stream.

Thanks!

galkahana commented 8 years ago

Hi Alexey. great job. I need to read into this and I was very busy. hopefully i can find some time today evening, so i can realize what needs to be done.

from little that i read in "8.4.4 Appearance Streams" they are actually form xobjects. In that case i would recommend (and hopefully its the right way) to simply create a form xobject with the new appearance as a representative of the old object, replacing the existing stream. Using a form xobject will take care of resources by itself. you can read about forms in hummus here - https://github.com/galkahana/HummusJS/wiki/Reusable-forms

you can start a form xobject with an existing object ID by passing 5 numbers instead of 4 (the initial 4 are the bbox of the form, which you can read from the existing AP or create new ones of your own).

When i get back home today i'll try to find time to read more, to see that we're not missing something.

Gal.

fedyunin commented 8 years ago

Hello! Thank you very much for answer! I will try use xobjects. I going read about it additionally too.

galkahana commented 8 years ago

k. so here's what i think is going on, having read 8.6.2 [form fields], 8.4.4 [appearance streams] and 8.4.5 [annotations types, specifically widget annotation].

Any form fields that is terminal should either have a kids array with widget annotations describing its display, or be itself a widget annotation. by being itself, i mean that it will not have a kids array, but rather contain the widget annotation keys in itself.

widget annotation includes "AP" to describe the appearance stream. this AP points to a form xobject that describes the appearance of the field. that's one appearance stream to keep in mind. let's call it AP. i'm guessing that in case of text field, then if there is one, it defines whats displayed before you start editing in acrobat. I am guessing (and that's pure guess, but can be verified by parsing) that acrobat updates this AP if the file gets saved, to reflect the new appearance. I think that this must happen for proper later printing.

In addition a text field will have a "DA" appearance string (string! not stream) which defines variable appearance. you need variable appearance because the form will have text edited in acrobat and you want to understand how to show it AND how to construct an AP form xobject once the file is saved.

From what i parsed from your example DAs can look like this: "/Helv 9 Tf 0 g", which means helvetica size 9, in black.

so long story short, i expect that you should already have a DA in text fields, you just need to recreate an AP, N (normal) stream. To do this create a form xobject using the instructions in 8.6.2 variable text. the content context of hummus should have all the required commands (Q, q etc) and will also take care of creating font definitions and embed the correct glyphs in the same way as you use writeText. you can obviously use writeText itself. Note that hummus requires that form xobjects are created for NEW object IDs and not moified, so you'll have to change the AP/N entry in the widget annotation to point to the new form xobjects. hope that's not too much of trouble. the code can be changed to reuse object IDs of the old form, but its slightly complex. so hopefully you can do without it.

It might be a good idea to parse a file that has a good appearance stream and make sure that my notes are valid.

Good luck, Gal.

fedyunin commented 8 years ago

Hello! Main idea looks like clear. Thank you for explanation!

What I did:

  1. Create a new XObject for AP:

(apObjectId - existing AP)

var apObject = pdfReader.parseNewObject(apObjectId).getDictionary().toJSObject();
var xobjectForm = pdfWriter.createFormXObject(apObject.BBox.toJSArray()[0].value, apObject.BBox.toJSArray()[1].value, apObject.BBox.toJSArray()[2].value, apObject.BBox.toJSArray()[3].value);

var font = pdfWriter.getFontForFile(__dirname + '/arial.ttf');
xobjectForm.getContentContext()
                        .BT()
                        .q()
                        .k(0, 0, 0, 1)
                        .Tf(font, 9)
                        .Tj(value)
                        .ET()
                        .Q();
pdfWriter.endFormXObject(xobjectForm);
  1. Write id of new Xobject to modifying field object (annotation object, which have AP):
var id = // XObject id; (xobjectForm.id)
dictionaryContext.writeKey('V').writeLiteralStringValue(new hummus.PDFTextString(value).toBytesArray());
dictionaryContext.writeKey('AP');
var apDict = objectsContext.startDictionary();
apDict.writeKey("N").writeObjectReferenceValue(id); // new ID write here
objectsContext.endDictionary(apDict);

But this is still doesn't work for me. Could you look at my code and let me know is it correct approach for recreate AP?

galkahana commented 8 years ago

Code seems fine to me. perhaps a DA string is required? at times like this i like to see what acrobat does. how about you take a pdf with a form that has a field that shows the problem, fill it op in acrobat, save and parse the result, figuring what they put in their Field object and widget annotation object?

Gal.

fedyunin commented 8 years ago

Hello!

1) Field which showed correctly (1 - Name of Organization):

DA="/Helv 9 Tf 0 g"
AP->N->3 (3 is ID of reference object - stream)

Reference object - stream contents:
Bbox=0, 0, 436.4403, 12.47998
Filter=FlateDecode
Length=82
Resources=[
Font=Helv (PDFIndirectObjectReference
ProcSet=PDF,Text
]
Subtype=Form
Type=XObject

2) Field which not showed correctly (2 - Other (describe): ):

DA="/Helv 9 Tf 0 g"
AP->N->31 (31 is ID of reference object - stream)

Reference object - stream contents:
Bbox=0, 0, 72.00003, 15.47998
Length=0
Subtype=Form
Type=XObject

So main difference - field which showed correctly has Length!=0 and Resources.

image

fedyunin commented 8 years ago

I filled Name and address values above, by using online tool for fill PDF

galkahana commented 8 years ago

well it's clear why "describe" has not stream - it's empty. it makes sense. you didn't fill describe. thats ok.

what we are trying to understand is why when your code actually writes something it doesn't work. can you compare by filling fields with that online tool and with your code? then see if the application does something different to figure out whats wrong.

fedyunin commented 8 years ago

Looks like I make it works!

I use the same code only made several changes:

  1. Before start writing modified field object dictionary (annotation object) - I recreate XObject of AP. Get ID of newly created XObject.
  2. While writing dictionary of modified field I update AP with new ID of XObject.

Still remains figure out how to use standart fonts from PDF (Section 5.5 Simple Fonts). And make adjustments for showing filled fields more better (make nice look - right now need adjust font size, position etc).

image

galkahana commented 8 years ago

Brilliant!!! Note that for convenience you can use the writeText method of content context instead of the BT...ET sequence. it just does that for you. [still might need nesting in Q...q].

Anyways, this is great news. Seems like we have a working solution in our hands!

Some notes, per the future queries:

As for fonts. Hummus pretty much hides the usage of fonts due to its complexity. when you use a font the pdf file has to include the font definition with all the glyphs that you use. Hummus does that for you. So essentially, to use any font just call writeText or Tf with a used font object as you did. If, for instance, you want to use helvetica, just point to the helvetica text file, as you did with Arial and done.

If you still want to use standard type 1 fonts, without embedding them, I think that the way to go is to call Tf with the font name as string - cxt.Tf('Helcetica',14). give it a try if you want.

If you want to simply write your own code with a content context you cal do this: cxt.writeFreeCode('hello world\r\n') [note that you must finish your code with ending \r\n regarding of the OSX that you are using]. this can become useful if you want to use the DA string for variable text form fields.

Good luck! Gal.

fedyunin commented 8 years ago

Thanks for advices. Will work on adjustments and final changes.

Thanks a lot for all your support here. You very helped for me dive deep in the PDF format and understand how to use this library!

Thanks!

alexprice1 commented 8 years ago

@fedyunin would you be willing to post your sample code? Trying to figure out how to modify a pdf I have.

code-by commented 7 years ago

@fedyunin I'm too looking for tool for fill forms, could you publish your complete code? thank you

PManager1 commented 7 years ago

@fedyunin Could you please publish the complete code along with the before and after pdf files on here ?

trendoid commented 7 years ago

Anyone ever figure out how to flatten a form after filling it out? I have everything working from this thread (awesome btw) but I still need to make things read only, flatten, the pdf that is sent out in my new hummus.PDFStreamForResponse(res)

galkahana commented 7 years ago

wrote working sample code. it's annotated so you can figure out bugs if you can see them. will publish a post with explanations soon, but the code should be beneficial before hand: https://github.com/galkahana/HummusJSSamples/tree/master/filling-form-values

usage: https://github.com/galkahana/HummusJSSamples/blob/master/filling-form-values/main.js

implementation code: https://github.com/galkahana/HummusJSSamples/blob/master/filling-form-values/pdf-form-fill.js

galkahana commented 7 years ago

Now in a post.

cjnqt commented 7 years ago

Anyone ever figure out how to flatten a form after filling it out? I have everything working from this thread (awesome btw) but I still need to make things read only, flatten, the pdf that is sent out in my new hummus.PDFStreamForResponse(res)

@trendoid Did you find a way to flatten the pdf after filling out the form?

Update: Solved it by setting the inputs to read-only. Works for our use-case

HEYGUL commented 6 years ago

@cjnqt

Update: Solved it by setting the inputs to read-only. Works for our use-case I need to do the same but cannot figure out how to set inputs as read only :/

cjnqt commented 6 years ago

did it when i created the form, using adobe acrobat or pdfescape.com. might also be possible to do with hummusjs, dont know

angieellis commented 6 years ago

Anyone know how to set the inputs to read-only programmatically? I need to "flatten" my pdf form fields after I write a value to them.

scopsy commented 6 years ago

@galkahana Also needed in flattening the pdf file, any suggestions how to do that ?

HappySeaFox commented 6 years ago

@galkahana Many thanks for sharing the JS code to update forms. I'm trying to re-use it in a C++ application, but I'm stuck here:

// otherwise, recreate the form as an indirect child (this is going to be a general policy, we're making things indirect. it's simpler), and recreate the catalog
var catalogObjectId = reader.getTrailer().queryObject('Root').toPDFIndirectObjectReference().getObjectID();

What is the C++ variant of toPDFIndirectObjectReference()? How can I get the object's id by its pointer? Thanks!

galkahana commented 6 years ago

you can use direct casting or PDFObjectCastPtr (https://github.com/galkahana/PDF-Writer/wiki/PDF-Parsing#pdfobjectcastptr) to case a PDFObject to other types. in this case queryObject will bring you a PDFObject and you want to case to PDFIndirectObjectReference .

this is not getting an object ID by its pointer. it's just that the object that is the value of the root key in the trailer is an indirect object ref.

dbmoderro commented 6 years ago

@galkahana Thanks! I decided to stick with the JS version, i.e. with HummusJS on Ubuntu 18.04. I use your great tutorial found here: https://github.com/galkahana/HummusJSSamples/tree/master/filling-form-values . I have a small PDF with just a single text form field. I modified main.js so it updates the text field with the text "Robert". However the resulting PDF is a little bit weird. When I open the resulting PDF, the text field is empty. If I click on it, it displays the text "Robert". When the field loses focus, it reverts to the empty state back again. Also if I print the resulting PDF, the text field is empty too. Please see name.pdf (the source PDF), and name-edit.pdf (the resulting PDF).

name.pdf name-edit.pdf

main.js looks as follows:

var hummus = require('hummus'),
fillForm = require('./pdf-form-fill').fillForm;

var writer = hummus.createWriterToModify(__dirname + '/name.pdf', {
                modifiedFilePath: __dirname + '/name-edit.pdf'
        });

var data = {
    "Name": "Robert"
};

fillForm(writer,data);
writer.end();

Any ideas? Thanks!

tonybranfort commented 6 years ago

Thanks galkahana for all your work here. I used your sample pdf-form-fill successfully but it does not display the field values in Adobe Acrobat DC - but does in Chrome. I opened Issue # 21 in HummusJSSamples