Hopding / pdf-lib

Create and modify PDF documents in any JavaScript environment
https://pdf-lib.js.org
MIT License
6.73k stars 642 forks source link

Copying encrypted PDF results in blank pages in the new PDF document #1390

Open DebadattaMeher opened 1 year ago

DebadattaMeher commented 1 year ago

What were you trying to do?

I am trying to draw some lines in an encrypted PDF document. I tried with 2 things to do this as the first approach did not work.

  1. Trying to draw lines directly on the PDF document.
  2. Copying the PDF document to a new document and then draw lines in the new document.

How did you attempt to do it?

  1. I've been able to draw lines to normal PDF documents so far using the pdf-lib npm library in node.js 14.x env. But with the encrypted PDF document, the lines are not visible or not drawn. There is no error though during the execution of code to draw lines in the encrypted PDF document.

Code Below for first approach. inputFile, outputFile are paths to file. eg: /tmp/tmpfiles/sample.pdf

let fileData = fs.readFileSync(inputFile);
const pdfDoc = await PDFDocument.load(fileData,
    { ignoreEncryption: true });
const pages = pdfDoc.getPages();
let pageIndex = 1      // There is some way that I'm getting this index. Also the PDF document has more than index+1 pages.

pages[pageIndex].drawLine({
        start: { x: start.x, y: start.y },
        end: { x: end.x, y: end.y },
        thickness: 2,
        color: rgb(0.5, 1, 0.5),
        opacity: 1,
      });
const modifiedContent = await pdfDoc.save({ useObjectStreams: false }); // I've added this option as I saw in an issue thread to use this option for encrypted files. Without this option we can't open the generated output file.
fs.writeFileSync(outputFile, modifiedContent);
  1. So, I tried an alternative to copy the encrypted PDF document to a new PDF document and try to draw lines in the new PDF document.

Code Below for 2nd approach.

let fileData = fs.readFileSync(inputFile);
  const srcPdf = await PDFDocument.load(fileData, { ignoreEncryption: true });
  const copiedPdf = await PDFDocument.create();
  const copiedPages = await copiedPdf.copyPages(srcPdf, srcPdf.getPageIndices());
  for (let i = 0; i < copiedPages.length; i++) {
    await copiedPdf.insertPage(i, copiedPages[i]);
  }

  let copiedPdfBytes = await copiedPdf.save({ useObjectStreams: false });
  fs.writeFileSync(outputFile, copiedPdfBytes);

What actually happened?

  1. In case of drawing line directly to the encrypted document, the lines are not visible or not drawn in the PDF. The lines doesn't appear in the PDF when opened in any viewer (chrome, default mac preview).

  2. I will use 'srcPDF' to refer to the encrypted PDF document and 'destPDF' to refer to the copied PDF document in further description below

On copying the srcPDF document to a destPDF, the pages in the destPDF document are blank (white pages). The destPDF contain same number of pages as the srcPDF. The file-size of the destPDF document is same as the srcPDF document.

What did you expect to happen?

  1. The lines should be drawn in the encrypted PDF document and visible.
  2. On copying an encrypted PDF document, the new PDF document should not contain blank pages. The new PDF document should contain actual contents from source PDF document.

How can we reproduce the issue?

I cannot share the PDF as it contains confidential detail and PII data. If possible, use the Automobile policy declaration PDF of Travelers insurance provider in USA.

I've share the code in the description.

Version

1.17.1

What environment are you running pdf-lib in?

Node

Checklist

Additional Notes

I apologize for not being able to share the PDF sample.

KammererTob commented 1 year ago

You are experiencing the limitations of this library in terms of encrypted PDFs. See here: https://github.com/Hopding/pdf-lib#encryption-handling. You are using ignoreEncryption, but this will just "hide" the issue.

DebadattaMeher commented 1 year ago

Thanks for building this awesome library, maintaining and documenting it so cleanly. Will there be a support for encrypted PDFs in future ?

KammererTob commented 1 year ago

I am not the original author of this package, but i've forked it and build very basic/crude decryption support: https://github.com/KammererTob/pdf-lib

Note that this is probably not bug-free and also not very efficient (it parses the PDF twice), but from my limited testing it worked for my purpose. I have not opened a PR for this, because of the above mentioned issues and also missing unit-tests

To use it you need to keep "ignoreEncryption: true" and it should then try to decrypt it.

DebadattaMeher commented 1 year ago

I am not the original author of this package, but i've forked it and build very basic/crude decryption support: https://github.com/KammererTob/pdf-lib

Note that this is probably not bug-free and also not very efficient (it parses the PDF twice), but from my limited testing it worked for my purpose. I have not opened a PR for this, because of the above mentioned issues and also missing unit-tests

To use it you need to keep "ignoreEncryption: true" and it should then try to decrypt it.

I used this option "ignoreEncryption: true" to be able to open the PDF document and try to draw lines on it. But unable to save the PDF after drawing lines. The output PDF is unedited (without containing the drawn lines) if used the encryption option during saving OR blank (empty pages) with exact same number of pages if used without the encryption option.

GiuseppePennisi commented 1 year ago

Hello @KammererTob, did you publish your fork somewhere?

KammererTob commented 1 year ago

@GiuseppePennisi No. I haven't published my fork anywhere. For my own project i build it once and then used the minified file directly.

Sharcoux commented 1 year ago

We'll try to add this to our fork and maintain it. Our fork already adds support to drawing svg. see this PR. I'll tell you if I succeeded.

Sharcoux commented 1 year ago

@KammererTob Can you confirm the following points with me?

Sharcoux commented 1 year ago

If I load a pdf with password, I get this error:

Uncaught (in promise) TypeError: dict.get(...) is undefined
    CipherTransformFactory crypto.js:1355
    PDFDocument PDFDocument.js:52
    load PDFDocument.js:132
    fulfilled tslib.es6.js:73
    promise callback*step tslib.es6.js:75
    __awaiter tslib.es6.js:76
    __awaiter tslib.es6.js:72
    load PDFDocument.js:124
    test test27.html:56
    onclick test27.html:1staticOnEvent

algorithm is 5 at that moment.

KammererTob commented 1 year ago

@Sharcoux It has been a while since i looked at the specs, but here are my answers from the top of my head:

  1. True. My code was only aimed at non password protected pdfs. I think one issue here is that you cannot check if there actually is a user password without checking some encryption details in the PDF first.
  2. Not sure where this is, but the crypto.ts is mainly based on the Mozilla pdf.js library (https://github.com/mozilla/pdf.js/blob/master/src/core/crypto.js) i would assume that there is a good reason for doing it like this.
  3. The line in the stacktrace is in the .js file, so not sure where exactly this is in the .ts file.
Sharcoux commented 1 year ago
  1. I know. I added an optional parameter to load(). I just load the file, and if it requires a password, I ask for the password and load again. So, nothing that should be handled from pdf-lib
  2. This is line 1705 of crypto.ts. I really don't see how the current code can work if the pdf uses the same password for user and owner pwd.
  3. This is line 1541 of crypto.ts: (dict.get(PDFName.of("EncryptMetadata")) as PDFBool).asBoolean() !== false;. There is no EncryptMetadata in the dict apparently...
Sharcoux commented 1 year ago

I could solve the 3rd issue. Apparently:

    EncryptMetadata (Boolean): It specifies whether the document metadata (information about the document itself such as title, author, etc.) is to be encrypted. This entry is meaningful only when encrypting metadata, which is when the value of the V entry (version of the standard security handler) in the encryption dictionary is greater than or equal to 2.
        If EncryptMetadata is true (or if it's not specified), then metadata should be encrypted.
        If EncryptMetadata is false, metadata should remain in plain text.

In some cases, for certain PDF versions or types of encryption, the EncryptMetadata key might not be present in the encryption dictionary. Therefore, when reading or processing the encryption dictionary in a PDF, it's essential to account for the possibility that the EncryptMetadata key might be absent. If it's not specified, its default value should be considered as true.

I solved some more issues, but I'm now stuck at this:

Uncaught (in promise) Error: Expected instance of PDFDict, but got instance of undefined
    UnexpectedObjectTypeError errors.js:22
    lookup PDFContext.js:81
    lookup PDFDict.js:52
    Pages PDFCatalog.js:7
    computePages PDFDocument.js:26
    access Cache.js:11
    getPages PDFDocument.js:478
    getPageCount PDFDocument.js:462
    save PDFDocument.js:1139
    __awaiter tslib.es6.js:76
    __awaiter tslib.es6.js:72
    save PDFDocument.js:1133
    test test27.html:57

It happens as soon as I use any method from the pdfDoc generated.

The error occurs when trying to read "Pages": return this.lookup(PDFName.of('Pages'), PDFDict) as PDFPageTree;

The final error is line 166 of PDFContext.

Sharcoux commented 1 year ago

I'm done. You can try npm package @cantoo/pdf-lib. Some details might change in the next few days, but don't hesitate to try and report.

chebum commented 6 months ago

PDFObjectParser.parseString method in pdf-lib doesn't handle encrypted strings correctly. Therefore strings don't get decrypted properly and decryption code by @KammererTob corrupts the document. I ported the parseString function from pdf.js (https://github.com/chebum/pdf-lib/commit/485d3523ef0b1582332aef4f768f52604f5a5dd1) - this seems to solve the problem.

Sharcoux commented 6 months ago

I have no news from hopding for a while. We maintain the repo @cantoo/pdf-lib for now. Do you know if your change is needed there too? It seems to work on our end.

chebum commented 6 months ago

It seems to work on our end.

@Sharcoux It seems to be working for some of the encoded strings, but not for all - some strings don't get decoded properly. For example, check object number 60 in this document https://github.com/chebum/pdf-lib/blob/master/assets/pdfs/encrypted_3.pdf. T field should say "Karen Scoon"

 const pdfDoc = await PDFDocument.load(
 fs.readFileSync('assets/pdfs/encrypted_3.pdf'),
 {
    parseSpeed: ParseSpeeds.Fastest,
 },
 );
 expect(pdfDoc).toBeInstanceOf(PDFDocument);
 const originalObjects = pdfDoc.context.enumerateIndirectObjects();
 const originalObj60 = originalObjects.find(obj => obj[0].objectNumber === 60) as [PDFRef, PDFDict];
 expect(originalObj60[1].get(PDFName.of('T'))?.toString()).toEqual('(Karen Scoon)');