Hopding / pdf-lib

Create and modify PDF documents in any JavaScript environment
https://pdf-lib.js.org
MIT License
6.77k stars 647 forks source link

v1.0.0 Beta #134

Closed Hopding closed 5 years ago

Hopding commented 5 years ago

@jerp @gregbacchus @mlecoq @kevinswartz @Jonathan-Mckenzie @philipjmurphy @DanielJackson-Oslo @matthopson @ithillel-aminev @vitaly-zdanevich

Hello everybody! Today I'm excited to announce the beta release of pdf-lib v1.0.0!

You have all provided extremely valuable feedback on pdf-lib over the past year. This feedback has highlighted several architectural flaws in the design of pdf-lib (e.g. the inability to load invalid PDFs). It has also brought to my attention some critical shortcomings in pdf-lib's feature set (e.g. no methods for getting the width/height of a page).

I've responded to these issues with specific workarounds and fixes as they've been reported. But I've also been thinking about how to solve them in a more wholistic way.

With this goal in mind, I've been hard at work these past few months working on a complete rewrite of pdf-lib. This rewrite is now mostly complete. All that remains is to write documentation and implement a few small features.

I'd like to ask for your help in beta testing this rewrite. I've written extensive automated and manual tests, and have verified that everything works in all major PDF readers (Acrobat, Foxit, Preview) and browsers (Chrome, Firefox, Safari). But since this rewrite is so extensive, I do not want to do a full release until others have been able to test it.


I'm working on a complete changelog, but it might be a week or two until I am able to complete it. So, in the meantime, here's a list of the main changes/improvements in v1.0.0:

The README for the rewrite can be found here: https://github.com/Hopding/pdf-lib/tree/Rewrite.

You can install the beta version of v1.0.0 with npm:

npm install --save pdf-lib@beta

or yarn:

yarn add pdf-lib@beta

It's also available on the unpkg CDN:


Those of you that intend to participate in this beta test, please post a comment in this thread to let me know! If you do not plan to participate, please just ignore this (I understand that not everybody will be able to participate).

I'd like to keep all issues and discussion pertaining to the beta test centralized in this thread. However, if you need to communicate privately with me, please feel free to email me at andrew.dillon.j@gmail.com.

I appreciate your help and am looking forward to a successful release of v1.0.0!

matthopson commented 5 years ago

I’ll start testing it in our app. Thanks for all the work on this. I’m out of the office for a few days, but hope to have some feedback to you by the end of the week.

DanielJackson-Oslo commented 5 years ago

Thank you for the work you do, @Hopding ! It really is very helpful.

I get an error in the new branch, though the code worked in the old one, and I'm not sure if it's my fault for misunderstanding the API or if something is broken:

(The code is intended to take a list of unknown documents of unknown length and merge it into one new PDF)

const { PDFDocument } = require('pdf-lib')

async function mergePdfs(pdfsToMerge, filePath) {
  // pdfsToMerge is an array of filepaths pointing to PDFs generated  or downloaded
  const mergedPdf = PDFDocument.create()
  pdfsToMerge.forEach(pdfFilePath => {
    const pdf = fs.readFileSync(pdfFilePath)
    const pagesToMerge = PDFDocument.load(pdf).getPages()
    pagesToMerge.forEach(page => {
      mergedPdf.addPage(page)
    })
  })
  const mergedPdfFile = await mergedPdf.save()
  return fs.writeFileSync(filePath, mergedPdfFile)
}

Produces error PDFDocument.load(...).getPages is not a function

From the documentation on https://github.com/Hopding/pdf-lib/tree/Rewrite it seems like I'm using both PDFDocument.load() and .getPages() correctly, and that neither have changed APIs since 0.x.x?

Am I doing something wrong here? Willing to test more, if it helps!

mlecoq commented 5 years ago

@DanielJackson-Oslo PDFDocument.load returns a promise

const pagesToMerge = (await PDFDocument.load(pdf)).getPages()

Hopding commented 5 years ago

@DanielJackson-Oslo there's a few things that will have to be changed in your code for v1.0.0:

I've modified the snippet you shared to work in v1.0.0. It ~should~ does work fine ~(though I haven't actually tested it :smile:)~. Let me know if you have any trouble and I'll get it fixed.

async function mergePdfs(pdfsToMerge: string[], filePath: string) {
  const mergedPdf = await PDFDocument.create();
  for (const pdfFilePath of pdfsToMerge) {
    const pdfBytes = fs.readFileSync(pdfFilePath);
    const pdf = await PDFDocument.load(pdfBytes);
    const pageIndices = Array.from(pdf.getPages().keys());
    const copiedPages = await mergedPdf.copyPages(pdf, pageIndices);
    copiedPages.forEach((page) => {
      mergedPdf.addPage(page);
    });
  }
  const mergedPdfFile = await mergedPdf.save();
  return fs.writeFileSync(filePath, mergedPdfFile);
}

If your able to, I'd be interested to know whether or not the merged PDF files you produce are smaller in v1.0.0 than in v0.x.x!

mlecoq commented 5 years ago

@Hopding the migration was a bit difficult due to all the changes in API.

But the new one is easier to use and understand.

I am testing right now. Everything is fine for the moment, all my tests passed successfully.

To get cropbox or mediabox in all cases, I have replaced this piece of code :

const cropBox =
                page.getMaybe('CropBox') ||
                page.getMaybe('MediaBox') ||
                (pdfDoc.catalog.Pages &&
                    (pdfDoc.catalog.Pages.getMaybe('CropBox') || pdfDoc.catalog.Pages.getMaybe('MediaBox')));

I had to look into the catalog for some pdfs

with that :

const cropBox = page.node.CropBox() || page.node.MediaBox();

I assume that all cases are covered with it.

As I have remarked before, performances are much better

Hopding commented 5 years ago

@mlecoq I'm glad to hear the API is simpler to use and that the performance has improved! I'm hoping to write up a migration guide for the full release. If there was anything in particular that caused you trouble during the migration, please let me know so I can address it in the migration guide.

As far as the code for obtaining a page's CropBox and/or MediaBox: Yes, the new API has the page.node.CropBox() and page.node.MediaBox() methods that will return the correct values in all cases.

That being said, I'm curious what exactly you are using the raw values for CropBox and MediaBox for? These are fairly low-level (hence why they're on page.node instead of page directly), and while there's nothing wrong with using them, I'm hoping that the high-level API will support most use cases without requiring users to dive into the guts of things. Do you think there's an opportunity here to modify/enhance the high-level API?

mlecoq commented 5 years ago

I work with plans and I need to put circle according to given coordinates. These coordinates are a ratio of the whole document size (all pages). So I need to recalculate the real positions on pages for them. For example, if my y coordinate is 0.7. To figure out on which page must be placed the circle, I need to add height of all pages, then multiply this height by 0.7 and find the concerned page.

I have tried to use getWidth(), getHeight(), getX() and getY(). Returned x and y was 0 despite it was not the origin of the document. I need to get the crop box to have it. Without using crop box, all the circles were shifted

kevinswartz commented 5 years ago

I'm definitely interested in doing some testing with the beta. I'm out of the office this week, but I hope to have some time to get into this next week. Thanks for your work here, it's great!

pjm4 commented 5 years ago

Hi @Hopding - that's great to hear. Installing now and will begin some preliminary tests against the PDF's that I have.

pjm4 commented 5 years ago

Here's a brain dump of what I have so far ( I know you've mentioned some of these already, but I've included them for completeness):

This is where things came to a halt as the pdfDoc.embedJPG threw an exception:

  let mediaBuffer = await fetch(mediaUrl).then(r => r.arrayBuffer()).then(buffer => new Uint8Array(buffer))
  const [newMedia, dims] = await pdfDoc.embedJpg(mediaBuffer)

Unhandled Rejection (TypeError): arr[Symbol.iterator] is not a function.

Until I figured out that I needed to re-write the above as

const newMedia = await pdfDoc.embedJpg(mediaBuffer)
// newMedia.width, newMedia.height can be used instead of dims object.

Also, the AcroForm functions that you gave me don't work any more as there is no getMaybe function on the catalog. So, this is only blocker that I have.

That's it for the moment. Thanks

cjblb19 commented 5 years ago

This is great package, however seems latest code fails to merge pdfs on client side (tested on chrome). Changed merged code above to be instead

function readFileAsync(file) {
  return new Promise((resolve, reject) => {
    let reader = new FileReader();
    reader.onload = () => {
      resolve(reader.result);
    };
    reader.onerror = reject;
    reader.readAsArrayBuffer(file);
  })
}

async function mergePdfs(pdfsToMerge) {
  const mergedPdf = await PDFDocument.create()
  for (const pdfCopyDoc of pdfsToMerge) {
    const pdfBytes = await readFileAsync(pdfCopyDoc);
    const pdf = await PDFDocument.load(pdfBytes)
    const copiedPages = await mergedPdf.copyPages(pdf, pdf.getPages().keys())
    copiedPages.forEach(page => {
      mergedPdf.addPage(page)
    })
  }
  const mergedPdfFile = await mergedPdf.save()
  return mergedPdfFile
}

But all I get coming back is blank page. Anything missing?

I also get problems loading some pdfs downloaded off web, wondering if you can tell me why like this example They can be open in chrome and to me they shouldn't be encrypted.

Hopding commented 5 years ago

@philipjmurphy Yes, as you noted, the pdfDoc.embedPng and pdfDoc.embedJpg methods now return PDFImage objects which have the width and height of the image as properties. You can also scale down the width and height by a constant factor using the image.scale method, e.g.

const aBigImage = await pdfDoc.embedPng(aBigImageBytes)
const { width, height } = aBigImage.scale(0.25)

Regarding the missing getMaybe method: You should be able to replace all getMaybe() calls from v0.x.x with get() in v1.0.0. In v0.x.x, calls to get() would throw an error if the property was missing (hence the getMaybe() method). But in v1.0.0 get() will simply return undefined for missing properties.

Hopding commented 5 years ago

@cjblb19 It sounds like you are having two distinct issues:

Is this summary correct?


I looked at the PDF file you shared, and it is actually encrypted. You can verify this with Acrobat Reader:

Screen Shot 2019-07-08 at 8 19 51 PM

This is understandably confusing. After all, how can a document be encrypted if you're able to open it in your browser without entering a password? You can do this because the PDF specification actually defines a default password:

Screen Shot 2019-07-08 at 8 24 47 PM

This default password is what Chrome (and other readers) use to decrypt the PDF - thus allowing you to view an encrypted PDF without entering a password.

Encrypted documents are one of pdf-lib's weak points. It turns out that there are a lot more encrypted PDFs out there than you might think. But you'd never know, because many of them use this default password. Support for PDF decryption is high on my list of features/enhancements. However, I do not plan to include it in the v1.0.0 release.


If anybody would like to work with me to help add support for PDF decryption to pdf-lib, I'd greatly appreciate it! There's lots of work that can be done to make pdf-lib even better than it already is. But the pace of feature development currently depends primarily on how much time I have available to devote to it.

Hopding commented 5 years ago

@cjblb19 Regarding your problems merging PDFs: Can you please share an example document or two that I can use to reproduce the issue?

cjblb19 commented 5 years ago

@Hopding. Thanks for your help. These files below works on previous version. facebookiq_millennials_money_january2016.pdf iOS_Security_Guide.pdf

cjblb19 commented 5 years ago

@Hopding Your right about that encrypted PDF. Is an callback available to decrypt the data? As that file uses AES encryption maybe could use WebCrypto API if know encryption method.

pjm4 commented 5 years ago

Hi @Hopding - pdfDoc.catalog.get('AcroForm') returns undefined.

image

Hopding commented 5 years ago

@philipjmurphy You’ll need to do this:

pdfDoc.catalog.get(PDFName.of('AcroForm'));

v0.x.x converts the strings passed to get and getMaybe to PDFName objects. But in v1.0.0 you must always pass actual PDFName objects.

Hopding commented 5 years ago

@cjblb19 @DanielJackson-Oslo There was a bug in the original PDF merging snippet I posted:

// Doesn't work
mergedPdf.copyPages(pdf, pdf.getPages().keys());

// Does work
mergedPdf.copyPages(pdf, Array.from(pdf.getPages().keys()));

That's what I get for posting code without fully testing it 🙄. Sorry about that!

Here's the complete working version of the mergePdfs function:

async function mergePdfs(pdfsToMerge: string[]) {
  const mergedPdf = await PDFDocument.create();
  for (const pdfCopyDoc of pdfsToMerge) {
    const pdfBytes = fs.readFileSync(pdfCopyDoc);
    const pdf = await PDFDocument.load(pdfBytes);
    const pageIndices = Array.from(pdf.getPages().keys());
    const copiedPages = await mergedPdf.copyPages(pdf, pageIndices);
    copiedPages.forEach((page) => {
      mergedPdf.addPage(page);
    });
  }
  const mergedPdfFile = await mergedPdf.save();
  return mergedPdfFile;
}

I'll update the original snippet to avoid any future confusion if others come across this thread. I might also add a method to PDFDocument that returns Array.from(this.getPages().keys()), just because it's a bit of an awkward snippet. It would be much easier for users to call pdfDoc.getPageIndices() or pdfDoc.getPageRange()

thommath commented 5 years ago

@Hopding Amazing library, I almost lost hope on generating PDFs in frontend before I found this library. The Beta looks very good, I am missing the API reference you have in master it was nice. I can report that the beta works in firefox with generating PDFs using file-saver verified with firefox and ie viewer. Very good job, thank you for all your work!

DanielJackson-Oslo commented 5 years ago

@Hopding Tested your code now, and it works beautifully. Also learned about the for-of loop! Looks tasty. Thank you for the time you spend on this!

The size difference is not substantial. I just did an export of 61 PDFs before and after the 1.0.0 branch, and the sizes of all 61 PDFs are:

Before: 10,6 MB (10 469 458 bytes total) After 1.0.0: 10,2 MB (10 058 010 bytes total)

I also have some issues with fonts in one of the PDFs, but on both branches. Will try to experiment a bit more to see if the problem is on my end before I open a separate issue.

Hopding commented 5 years ago

I just released 1.0.0-beta.3. It includes a couple new features (page content translation and the option to save PDFs as base64 strings or data URIs) and a bug fix (https://github.com/Hopding/pdf-lib/issues/135). Here's the full diff: https://github.com/Hopding/pdf-lib/compare/ce12061..d123b59

It's the latest @beta tag. You can also install it explicitly with npm:

npm install --save pdf-lib@1.0.0-beta.3

or yarn:

yarn add pdf-lib@1.0.0-beta.3

It's also available on the unpkg CDN:

pjm4 commented 5 years ago

Hi @Hopding - have been testing over the last few evenings and it is working very well. I've mainly been testing AcroForm/AcroField manipulation which works perfectly. New API is very clean and easy to use. Thanks

Hopding commented 5 years ago

Hello again everybody! I've just finished creating a project site for pdf-lib: https://pdf-lib.js.org/. This site also includes API docs for v1.0.0: https://pdf-lib.js.org/docs/api/ (this should be of interest to you @thommath). I'd greatly appreciate everybody's feedback on the site and the API docs.

Also, I'm planning to close out the beta test in a week or so unless any issues are discovered. Once I close out beta, I'll switch the master branch to track Rewrite and cut the official v1.0.0 release! Thanks again to everybody for helping test things out.

Hopding commented 5 years ago

@philipjmurphy Would you be interested in writing up a section for the README (in the Rewrite branch) on migrating from v0.x.x to v1.0.0? I think it would be best for a user of pdf-lib to write this up, as I am more likely to miss small details that could cause frustration to users.

And no worries if you're unable to do this right now. But please do let me know either way 😄

pjm4 commented 5 years ago

@Hopding Hi I can try to put together what I mentioned above. I just have the notes that I posted here. I'll put it together as best as I can where you can add to it if I've missed other stuff.

Hopding commented 5 years ago

v1.0.0 is officially released! The full release notes are available here.

You can install it with npm:

npm install --save pdf-lib@1.0.0

or yarn:

yarn add pdf-lib@1.0.0

It's also available on the unpkg CDN:

Thanks again everybody for your help!

sgt-madcap commented 4 years ago

I can't find any example of how to merge pdfs in browser on frontend. Sorry for asking but what should be inside of pdfsToMerge ? Is it an array with urls of pdf files ?

Hopding commented 4 years ago

@sgt-madcap Here's a modified version of the example I shared in https://github.com/Hopding/pdf-lib/issues/252#issuecomment-566063380 that demonstrates how to merge two PDFs into a single document:

const mergedPdf = await PDFDocument.create();

const url1 = 'https://pdf-lib.js.org/assets/with_update_sections.pdf';
const url2 = 'https://pdf-lib.js.org/assets/with_large_page_count.pdf';

const pdfABytes = await fetch(url1).then(res => res.arrayBuffer());
const pdfBBytes = await fetch(url2).then(res => res.arrayBuffer());

const pdfA = await PDFDocument.load(pdfABytes);
const pdfB = await PDFDocument.load(pdfBBytes);

const copiedPagesA = await mergedPdf.copyPages(pdfA, pdfA.getPageIndices());
copiedPagesA.forEach((page) => mergedPdf.addPage(page));

const copiedPagesB = await mergedPdf.copyPages(pdfB, pdfB.getPageIndices());
copiedPagesB.forEach((page) => mergedPdf.addPage(page));

const mergedPdfFile = await mergedPdf.save();
ravivarma4003 commented 3 years ago

Unhandled Rejection (TypeError): pdfDoc.embedJpg is not a function (anonymous function) D:/bugsynext/src/components/employees/employeeProfile/EmployeeImmigrationWF/DocumentsUploaded/DocumentsUpload.jsx:148 145 | 146 | // Load a PDFDocument from the existing PDF bytes 147 | const pdfDoc = PDFDocument.load(existingPdfBytes)

148 | const jpgImage = pdfDoc.embedJpg(jpgImageBytes) | ^ 149 | const pngImage = pdfDoc.embedPng(pngImageBytes) 150 |

ravivarma4003 commented 3 years ago

Unhandled Rejection (TypeError): pdfDoc.embedJpg is not a function (anonymous function) D:/bugsynext/src/components/employees/employeeProfile/EmployeeImmigrationWF/DocumentsUploaded/DocumentsUpload.jsx:148 145 | 146 | // Load a PDFDocument from the existing PDF bytes 147 | const pdfDoc = PDFDocument.load(existingPdfBytes)

148 | const jpgImage = pdfDoc.embedJpg(jpgImageBytes) | ^ 149 | const pngImage = pdfDoc.embedPng(pngImageBytes) 150 |

@Hopding @matthopson @thommath @kevinswartz @mlecoq @philipjmurphy @matthopson @kevinswartz please help me out of this

sudhakar-selva commented 1 year ago

If anybody would like to work with me to help add support for PDF decryption to pdf-lib, I'd greatly appreciate it!

@Hopding is it still open :-)? any guidelines for the starters please(new to pdf spec). I was looking into this doc.

Alokkumar8 commented 1 year ago

you made that comment on 2019. Its 2023. @Hopding have you added support for default password pdfs?

alexeysergeev-cm commented 3 months ago

@cjblb19 It sounds like you are having two distinct issues:

  • You're able to merge PDFs without any errors, but the resulting mergedPdf only contains a single blank page
  • You're unable to load certain PDFs due to an encryption error

Is this summary correct?

I looked at the PDF file you shared, and it is actually encrypted. You can verify this with Acrobat Reader:

Screen Shot 2019-07-08 at 8 19 51 PM

This is understandably confusing. After all, how can a document be encrypted if you're able to open it in your browser without entering a password? You can do this because the PDF specification actually defines a default password:

Screen Shot 2019-07-08 at 8 24 47 PM

This default password is what Chrome (and other readers) use to decrypt the PDF - thus allowing you to view an encrypted PDF without entering a password.

Encrypted documents are one of pdf-lib's weak points. It turns out that there are a lot more encrypted PDFs out there than you might think. But you'd never know, because many of them use this default password. Support for PDF decryption is high on my list of features/enhancements. However, I do not plan to include it in the v1.0.0 release.

If anybody would like to work with me to help add support for PDF decryption to pdf-lib, I'd greatly appreciate it! There's lots of work that can be done to make pdf-lib even better than it already is. But the pace of feature development currently depends primarily on how much time I have available to devote to it.

Hi, is there a way to decrypt the default password?