Hopding / pdf-lib

Create and modify PDF documents in any JavaScript environment
https://pdf-lib.js.org
MIT License
6.92k stars 660 forks source link

Failure Exception: Converting circular structure to JSON #1692

Open influential-eliot opened 5 days ago

influential-eliot commented 5 days ago

What were you trying to do?

I was trying to take a PDF and read the form details to return said form details in a JSON object. Later I wish the same function to also include the PDF file as a flattened file.

This is running in a Node.js "serverless" fashion.

How did you attempt to do it?

I submitted the file as either an application/pdf or application/octet based Base64 string (non-uri) via the body.$content property. I used the below code in order to do this:

const PDFDocument = require('../pdf-lib').PDFDocument;

module.exports = async function (context, req) {
    //const { PDFDocument } = require('../pdf-lib');
    console.log('PDF Loading ...');
    let pdfDoc = await PDFDocument.load(req.body.$content);
    console.log('PDF Loaded!');
    let form = pdfDoc.getForm();
    console.log('PDF Form Loaded');

    context.res = {
        // status: 200, // Defaults to 200
        "headers": {
            "Content-Type": "application/json"
        },
        "body": {
            "formData": form
        }
    };
};

It was seemingly able to at least get to the stage that it at least is triggerable.


The pdf-lib file is located exactly where you think it might be as per the above ... it is in an index.js AND an index.mjs file ... wasn't sure which to use so just made both. This is being done to avoid potential throttling/abuse of importing directly from the source links.

What actually happened?

The (slightly obfuscated) error that I get says: Result: Failure Exception: Converting circular structure to JSON --> starting at object with constructor 'PDFContext' | property 'trailerInfo' -> object with constructor 'Object' | property 'ID' -> object with constructor 'PDFArray' --- property 'context' closes the circle Stack: TypeError: Converting circular structure to JSON --> starting at object with constructor 'PDFContext' | property 'trailerInfo' -> object with constructor 'Object' | property 'ID' -> object with constructor 'PDFArray' --- property 'context' closes the circle at JSON.stringify () at t.toTypedData (/serverlesshost/workers/node/dist/src/worker-bundle.js:2:68165) at t.toRpcHttp [as converter] (/serverlesshost/workers/node/dist/src/worker-bundle.js:2:71158) at /serverlesshost/workers/node/dist/src/worker-bundle.js:2:64607 at Array.map () at t.InvocationModel. (/serverlesshost/workers/node/dist/src/worker-bundle.js:2:64565) at Generator.next () at /serverlesshost/workers/node/dist/src/worker-bundle.js:2:61778 at new Promise () at p (/serverlesshost/workers/node/dist/src/worker-bundle.js:2:61523)

What did you expect to happen?

Honestly? I didn't expect things to work perfectly first time, I'm trying this all for the first time and I'm probably using the functions wrongly or something ... I had thought I'd got it right, though, from the docs, and a few issues / discussions that I'd read here.

How can we reproduce the issue?

Spin up a free serverless Node.js at a large multinational Tech firm who host a bunch of organisation level cloud tech under a shade of blue. Place the pdf-lib in an accessible folder and go go go... :)

Version

Latest from the links on the site.

What environment are you running pdf-lib in?

Node

Checklist

Additional Notes

No response

influential-eliot commented 5 days ago

Is this the same as #1645? ( "library chokes when trying to include it in rollup" )

( only just seen @ddtbuilder's issue ... apologies ... I had no idea to search for 'rollup' ... and am still non-the-wiser to its meaning 🫢 ... but my search for 'circular structure' found nowt ... sorry to anyone that this might bother! )


EDIT

Some Progress ...

I've changed the following, and it actually doesn't error now ... so I am wondering if the circular errors are because nothing is actually being done with the data, @ddtbuilder?

        ...
        "body": {
            "$content-type": "application/octet", 
            "$content": form.flatten()
        }
        ...

That said, it doesn't respond with anything when I do this ... so ... now it's doing something ... but ... it's unclear exactly what. :(

EDIT_2 ... here's the full, current, code:

const PDFDocument = require('../pdf-lib').PDFDocument;

module.exports = async function (context, req) {
    //const { PDFDocument } = require('../pdf-lib');

    //const pdfBuffer = Buffer.from(pdfBase64, 'base64');
    console.log('PDF Loading ...');
    let pdfDoc = await PDFDocument.load(req.body);
    //let pdfDoc = await PDFDocument.load(req.body.$content);
    console.log('PDF Loaded!');
    let form = pdfDoc.getForm();
    console.log('PDF Form Loaded');
    form.flatten();
    const pdfBytes = await pdfDoc.save();
    let file = new File([pdfBytes], 'file.pdf', {type: 'application/pdf'});

    context.res = {
        // status: 200, // Defaults to 200
        "headers": {
            "Content-Type": "application/pdf"
        },
        "body": file
    };
};

I think I'm probably 'doing it wrong' with the file output ... if anyone can help ... then I'll close this off. :)

EDIT_3 OK, so the error is coming back again, and it is when I'm trying to push things into a usable framework, aka JSON.

So, when I try to use the getFields() it will come up with this. Presumably because the data in getFields() has circular references ... but I wouldn't know how that is.

Why would the code below produce the Circular error?

const PDFDocument = require('../pdf-lib').PDFDocument;

module.exports = async function (context, req) {
    const pdfDoc = await PDFDocument.load(req.body)
    const form = pdfDoc.getForm()
    const fields = form.getFields();
    //const fieldsJson = JSON.stringify(form.getFields());
    context.res = {
        // status: 200, // Defaults to 200
        "body": fields
    };
};

Because surely the fields are all individula, right?

Would something like this solution actually remove important data?

influential-eliot commented 4 days ago

I have removed the circular references, using this solution.

@ddtbuilder, you may wish to check out my final code at the end of this.

However, I can't really mark this as a workaround or fixed, because I don't know the volatility that deleting the circular references will cause to the data.

Sorry to bother you, @Hopding, but do you know if this will this cause a data loss?


Serverless Apps For Flatten & Fields To JSON

Either way, I have two separate (would like to combine) serverless apps running, flatten and fields, which respond to HTTP requests for either. The key thing was to ensure that I wouldn't upset the unpkg & jsdelivr endpoints with import requests, and also keep the speed fast on execution. You can store both on the same storage solution, and just keep the pdf-lib file (renamed to 'index') in a folder named pdf-lib in the root of the app.

Flatten

This code will flatten a file:

//This is the pdf-lib used from: https://pdf-lib.js.org/
//This version (v1.17.1) is usable under Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0
const PDFDocument = require('../pdf-lib').PDFDocument;

//This is the HTTP responder
module.exports = async function (context, req) {
    const ct = "application/pdf";
    const pdfDoc = await PDFDocument.load(req.body)
    const form = pdfDoc.getForm()
    form.flatten();
    const pdfBytes = await pdfDoc.save()

    context.res = {
        // status: 200, // Defaults to 200
        "headers": {
            "Content-Type": ct
        },
        "body": pdfBytes
    };
};

Flatten

This code will retrieve all present fields in PDF:

//This is the pdf-lib used from: https://pdf-lib.js.org/
//This version (v1.17.1) is usable under Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0
const PDFDocument = require('../pdf-lib').PDFDocument;

//This is the HTTP responder
module.exports = async function (context, req) {
    const pdfDoc = await PDFDocument.load(req.body)
    const form = pdfDoc.getForm()
    const fields = form.getFields();

    fields.forEach(field => {
          const typeOfField = field.constructor.name
          const nameOfField = field.getName()
          const fieldMaybs = form.getFieldMaybe(nameOfField)
          var hasValue = false;
          field.fieldType = typeOfField;
          field.fieldName = nameOfField;
          let value;
          if (fieldMaybs) {
            // No need for the following right now, however if this stage of the manipulation is removing them, then they are needed:
            //if ( typeOfField === 'PDFSignature' ) {}
            //if ( typeOfField === 'PDFButton' ) {} 
            if ( typeOfField === 'PDFCheckBox' ) {
                hasValue = true;
                value = form.getCheckBox(nameOfField).isChecked();
            } else if ( typeOfField === 'PDFDropdown' ) {
                if ( form.getDropdown(nameOfField).getSelected()?.length ) {
                    hasValue = true;
                    value = form.getDropdown(nameOfField).getSelected();
                }; 
            } else if ( typeOfField === 'PDFOptionList' ) {
                if ( form.getOptionList(nameOfField).getSelected()?.length ) {
                    hasValue = true;
                    value = form.getOptionList(nameOfField).getSelected();
                };
            } else if ( typeOfField === 'PDFRadioGroup' ) {
                // THIS MAY REQUIRE A LOT MORE COMPLEXITY UNLESS THAT CAN BE OFFLOADED TO THE CLIENT
                if ( form.getRadioGroup(nameOfField).getSelected().length ) {
                    hasValue = true;
                    value = form.getRadioGroup(nameOfField).getSelected();
                };
            } else if ( typeOfField === 'PDFTextField' ) {
                if ( form.getTextField(nameOfField).getText() ) {
                    hasValue = true;
                    value = form.getTextField(nameOfField).getText();
                };
            };
            field.hasValue = hasValue;
            field.fieldValue = value;
          };
        })
    const newFields = fields.filter(removeNoVals).map(({acroField, doc, ...item}) => item);

    let fieldsJson = pdfFieldsJson(newFields);
    let fieldsUniq = [...new Set(fieldsJson)];
    context.res = {
        // status: 200, // Defaults to 200
        // The Counts are there for error checking in the event that either the circular reference removal (the pdfFieldsJson function) removes required data for some reason or there are duplicated fields (less of an issue)
        "headers": {
            "Content-Type": "application/json"
        },
        "body": {
            "Counts": {
                "BeforeCircular": newFields.length,
                "AfterCircular": fieldsJson.length,
                "UniqueFields": fieldsUniq.length
            },
            "fields": fieldsJson
        }
    };
};

function removeNoVals(value) {
    return value.hasValue === true;
};

function pdfFieldsJson(obj) {
  // source: https://codedamn.com/news/javascript/how-to-fix-typeerror-converting-circular-structure-to-json-in-js#the_solution
  let cache = [];
  let str = JSON.stringify(obj, function(key, value) {
    if (typeof value === "object" && value !== null) {
      if (cache.indexOf(value) !== -1) {
        // Circular reference found, discard key
        return;
      }
      // Store value in our collection
      cache.push(value);
    }
    return value;
  });
  cache = null; // reset the cache
  return JSON.parse(str);
};

I would obviously accept any/all attempts to make this more efficient and/or combine with the flatten process to reduce costs.