Closed Hopding closed 5 years ago
I’ll start testing it in our app. Thanks for all the work on this. I’m out of the office for a few days, but hope to have some feedback to you by the end of the week.
Thank you for the work you do, @Hopding ! It really is very helpful.
I get an error in the new branch, though the code worked in the old one, and I'm not sure if it's my fault for misunderstanding the API or if something is broken:
(The code is intended to take a list of unknown documents of unknown length and merge it into one new PDF)
const { PDFDocument } = require('pdf-lib')
async function mergePdfs(pdfsToMerge, filePath) {
// pdfsToMerge is an array of filepaths pointing to PDFs generated or downloaded
const mergedPdf = PDFDocument.create()
pdfsToMerge.forEach(pdfFilePath => {
const pdf = fs.readFileSync(pdfFilePath)
const pagesToMerge = PDFDocument.load(pdf).getPages()
pagesToMerge.forEach(page => {
mergedPdf.addPage(page)
})
})
const mergedPdfFile = await mergedPdf.save()
return fs.writeFileSync(filePath, mergedPdfFile)
}
Produces error PDFDocument.load(...).getPages is not a function
From the documentation on https://github.com/Hopding/pdf-lib/tree/Rewrite it seems like I'm using both PDFDocument.load() and .getPages() correctly, and that neither have changed APIs since 0.x.x?
Am I doing something wrong here? Willing to test more, if it helps!
@DanielJackson-Oslo PDFDocument.load returns a promise
const pagesToMerge = (await PDFDocument.load(pdf)).getPages()
@DanielJackson-Oslo there's a few things that will have to be changed in your code for v1.0.0:
PDFDocument.create()
and PDFDocument.load(...)
are now async (they return promises). The purpose of this change is to avoid blocking the event loop (especially for browser-based usage). What this means is that you'll have to await
on them (or use promise chaining: .then(res => ...)
). Also note that if you aren't running this code client-side and are not concerned about blocking the event loop, you can speed up parsing times with: PDFDocument.load(..., { parseSpeed: ParseSpeeds.Fastest })
. You can do a similar thing for save times: PDFDocument.save({ objectsPerTick: Infinity })
(I have yet to make a SaveSpeeds
enum).destPdf.copyPages(srcPdf, srcPageIndexesArray)
to copy pages from one document to another. You can see an example of this in the Copy Pages usage example (I've also modified your code to do this). Admittedly, this API is slightly less ergonomic than what exists in v0.x.x. But it has two key benefits:
PDFDocument.addPage
and PDFDocument.insertPage
async.I've modified the snippet you shared to work in v1.0.0. It ~should~ does work fine ~(though I haven't actually tested it :smile:)~. Let me know if you have any trouble and I'll get it fixed.
async function mergePdfs(pdfsToMerge: string[], filePath: string) {
const mergedPdf = await PDFDocument.create();
for (const pdfFilePath of pdfsToMerge) {
const pdfBytes = fs.readFileSync(pdfFilePath);
const pdf = await PDFDocument.load(pdfBytes);
const pageIndices = Array.from(pdf.getPages().keys());
const copiedPages = await mergedPdf.copyPages(pdf, pageIndices);
copiedPages.forEach((page) => {
mergedPdf.addPage(page);
});
}
const mergedPdfFile = await mergedPdf.save();
return fs.writeFileSync(filePath, mergedPdfFile);
}
If your able to, I'd be interested to know whether or not the merged PDF files you produce are smaller in v1.0.0 than in v0.x.x!
@Hopding the migration was a bit difficult due to all the changes in API.
But the new one is easier to use and understand.
I am testing right now. Everything is fine for the moment, all my tests passed successfully.
To get cropbox or mediabox in all cases, I have replaced this piece of code :
const cropBox =
page.getMaybe('CropBox') ||
page.getMaybe('MediaBox') ||
(pdfDoc.catalog.Pages &&
(pdfDoc.catalog.Pages.getMaybe('CropBox') || pdfDoc.catalog.Pages.getMaybe('MediaBox')));
I had to look into the catalog for some pdfs
with that :
const cropBox = page.node.CropBox() || page.node.MediaBox();
I assume that all cases are covered with it.
As I have remarked before, performances are much better
@mlecoq I'm glad to hear the API is simpler to use and that the performance has improved! I'm hoping to write up a migration guide for the full release. If there was anything in particular that caused you trouble during the migration, please let me know so I can address it in the migration guide.
As far as the code for obtaining a page's CropBox
and/or MediaBox
: Yes, the new API has the page.node.CropBox()
and page.node.MediaBox()
methods that will return the correct values in all cases.
That being said, I'm curious what exactly you are using the raw values for CropBox
and MediaBox
for? These are fairly low-level (hence why they're on page.node
instead of page
directly), and while there's nothing wrong with using them, I'm hoping that the high-level API will support most use cases without requiring users to dive into the guts of things. Do you think there's an opportunity here to modify/enhance the high-level API?
I work with plans and I need to put circle according to given coordinates. These coordinates are a ratio of the whole document size (all pages). So I need to recalculate the real positions on pages for them. For example, if my y coordinate is 0.7. To figure out on which page must be placed the circle, I need to add height of all pages, then multiply this height by 0.7 and find the concerned page.
I have tried to use getWidth(), getHeight(), getX() and getY(). Returned x and y was 0 despite it was not the origin of the document. I need to get the crop box to have it. Without using crop box, all the circles were shifted
I'm definitely interested in doing some testing with the beta. I'm out of the office this week, but I hope to have some time to get into this next week. Thanks for your work here, it's great!
Hi @Hopding - that's great to hear. Installing now and will begin some preliminary tests against the PDF's that I have.
Here's a brain dump of what I have so far ( I know you've mentioned some of these already, but I've included them for completeness):
Needed to rename PDFDocumentFactory to PDFDocument. As now a promise I need to await for it as well as the getPages() method.
I used to use your custom getDimentions method, but this fails now with no page.getMaybe('MediaBox') as is not a function. However, no need to use this now as the page object already has the page dimensions.
Changed pdfDoc.createPage([width, height]) to pdf.addPage()
Rewrote let pdfDocBytes = PDFDocumentWriter.saveToBytes(pdfDoc) as let pdfDocBytes = await pdfDoc.save()
Renamed pdfDoc.embedPNG to pdfDoc.embedPng and pdfDoc.embedJPG to pdfDoc.embedJpg
This is where things came to a halt as the pdfDoc.embedJPG threw an exception:
let mediaBuffer = await fetch(mediaUrl).then(r => r.arrayBuffer()).then(buffer => new Uint8Array(buffer))
const [newMedia, dims] = await pdfDoc.embedJpg(mediaBuffer)
Unhandled Rejection (TypeError): arr[Symbol.iterator] is not a function.
Until I figured out that I needed to re-write the above as
const newMedia = await pdfDoc.embedJpg(mediaBuffer)
// newMedia.width, newMedia.height can be used instead of dims object.
Also, the AcroForm functions that you gave me don't work any more as there is no getMaybe function on the catalog. So, this is only blocker that I have.
That's it for the moment. Thanks
This is great package, however seems latest code fails to merge pdfs on client side (tested on chrome). Changed merged code above to be instead
function readFileAsync(file) {
return new Promise((resolve, reject) => {
let reader = new FileReader();
reader.onload = () => {
resolve(reader.result);
};
reader.onerror = reject;
reader.readAsArrayBuffer(file);
})
}
async function mergePdfs(pdfsToMerge) {
const mergedPdf = await PDFDocument.create()
for (const pdfCopyDoc of pdfsToMerge) {
const pdfBytes = await readFileAsync(pdfCopyDoc);
const pdf = await PDFDocument.load(pdfBytes)
const copiedPages = await mergedPdf.copyPages(pdf, pdf.getPages().keys())
copiedPages.forEach(page => {
mergedPdf.addPage(page)
})
}
const mergedPdfFile = await mergedPdf.save()
return mergedPdfFile
}
But all I get coming back is blank page. Anything missing?
I also get problems loading some pdfs downloaded off web, wondering if you can tell me why like this example They can be open in chrome and to me they shouldn't be encrypted.
@philipjmurphy Yes, as you noted, the pdfDoc.embedPng
and pdfDoc.embedJpg
methods now return PDFImage
objects which have the width
and height
of the image as properties. You can also scale down the width and height by a constant factor using the image.scale
method, e.g.
const aBigImage = await pdfDoc.embedPng(aBigImageBytes)
const { width, height } = aBigImage.scale(0.25)
Regarding the missing getMaybe
method: You should be able to replace all getMaybe()
calls from v0.x.x with get()
in v1.0.0. In v0.x.x, calls to get()
would throw an error if the property was missing (hence the getMaybe()
method). But in v1.0.0 get()
will simply return undefined
for missing properties.
@cjblb19 It sounds like you are having two distinct issues:
mergedPdf
only contains a single blank pageIs this summary correct?
I looked at the PDF file you shared, and it is actually encrypted. You can verify this with Acrobat Reader:
This is understandably confusing. After all, how can a document be encrypted if you're able to open it in your browser without entering a password? You can do this because the PDF specification actually defines a default password:
This default password is what Chrome (and other readers) use to decrypt the PDF - thus allowing you to view an encrypted PDF without entering a password.
Encrypted documents are one of pdf-lib
's weak points. It turns out that there are a lot more encrypted PDFs out there than you might think. But you'd never know, because many of them use this default password. Support for PDF decryption is high on my list of features/enhancements. However, I do not plan to include it in the v1.0.0 release.
If anybody would like to work with me to help add support for PDF decryption to pdf-lib
, I'd greatly appreciate it! There's lots of work that can be done to make pdf-lib
even better than it already is. But the pace of feature development currently depends primarily on how much time I have available to devote to it.
@cjblb19 Regarding your problems merging PDFs: Can you please share an example document or two that I can use to reproduce the issue?
@Hopding. Thanks for your help. These files below works on previous version. facebookiq_millennials_money_january2016.pdf iOS_Security_Guide.pdf
@Hopding Your right about that encrypted PDF. Is an callback available to decrypt the data? As that file uses AES encryption maybe could use WebCrypto API if know encryption method.
Hi @Hopding - pdfDoc.catalog.get('AcroForm') returns undefined.
@philipjmurphy You’ll need to do this:
pdfDoc.catalog.get(PDFName.of('AcroForm'));
v0.x.x converts the strings passed to get
and getMaybe
to PDFName
objects. But in v1.0.0 you must always pass actual PDFName
objects.
@cjblb19 @DanielJackson-Oslo There was a bug in the original PDF merging snippet I posted:
// Doesn't work
mergedPdf.copyPages(pdf, pdf.getPages().keys());
// Does work
mergedPdf.copyPages(pdf, Array.from(pdf.getPages().keys()));
That's what I get for posting code without fully testing it 🙄. Sorry about that!
Here's the complete working version of the mergePdfs
function:
async function mergePdfs(pdfsToMerge: string[]) {
const mergedPdf = await PDFDocument.create();
for (const pdfCopyDoc of pdfsToMerge) {
const pdfBytes = fs.readFileSync(pdfCopyDoc);
const pdf = await PDFDocument.load(pdfBytes);
const pageIndices = Array.from(pdf.getPages().keys());
const copiedPages = await mergedPdf.copyPages(pdf, pageIndices);
copiedPages.forEach((page) => {
mergedPdf.addPage(page);
});
}
const mergedPdfFile = await mergedPdf.save();
return mergedPdfFile;
}
I'll update the original snippet to avoid any future confusion if others come across this thread. I might also add a method to PDFDocument
that returns Array.from(this.getPages().keys())
, just because it's a bit of an awkward snippet. It would be much easier for users to call pdfDoc.getPageIndices()
or pdfDoc.getPageRange()
@Hopding Amazing library, I almost lost hope on generating PDFs in frontend before I found this library. The Beta looks very good, I am missing the API reference you have in master it was nice. I can report that the beta works in firefox with generating PDFs using file-saver verified with firefox and ie viewer. Very good job, thank you for all your work!
@Hopding Tested your code now, and it works beautifully. Also learned about the for-of loop! Looks tasty. Thank you for the time you spend on this!
The size difference is not substantial. I just did an export of 61 PDFs before and after the 1.0.0 branch, and the sizes of all 61 PDFs are:
Before: 10,6 MB (10 469 458 bytes total) After 1.0.0: 10,2 MB (10 058 010 bytes total)
I also have some issues with fonts in one of the PDFs, but on both branches. Will try to experiment a bit more to see if the problem is on my end before I open a separate issue.
I just released 1.0.0-beta.3
. It includes a couple new features (page content translation and the option to save PDFs as base64 strings or data URIs) and a bug fix (https://github.com/Hopding/pdf-lib/issues/135). Here's the full diff: https://github.com/Hopding/pdf-lib/compare/ce12061..d123b59
It's the latest @beta
tag. You can also install it explicitly with npm
:
npm install --save pdf-lib@1.0.0-beta.3
or yarn:
yarn add pdf-lib@1.0.0-beta.3
It's also available on the unpkg CDN:
Hi @Hopding - have been testing over the last few evenings and it is working very well. I've mainly been testing AcroForm/AcroField manipulation which works perfectly. New API is very clean and easy to use. Thanks
Hello again everybody! I've just finished creating a project site for pdf-lib: https://pdf-lib.js.org/. This site also includes API docs for v1.0.0: https://pdf-lib.js.org/docs/api/ (this should be of interest to you @thommath). I'd greatly appreciate everybody's feedback on the site and the API docs.
Also, I'm planning to close out the beta test in a week or so unless any issues are discovered. Once I close out beta, I'll switch the master
branch to track Rewrite
and cut the official v1.0.0 release! Thanks again to everybody for helping test things out.
@philipjmurphy Would you be interested in writing up a section for the README (in the Rewrite
branch) on migrating from v0.x.x to v1.0.0? I think it would be best for a user of pdf-lib to write this up, as I am more likely to miss small details that could cause frustration to users.
And no worries if you're unable to do this right now. But please do let me know either way 😄
@Hopding Hi I can try to put together what I mentioned above. I just have the notes that I posted here. I'll put it together as best as I can where you can add to it if I've missed other stuff.
v1.0.0 is officially released! The full release notes are available here.
You can install it with npm
:
npm install --save pdf-lib@1.0.0
or yarn:
yarn add pdf-lib@1.0.0
It's also available on the unpkg CDN:
Thanks again everybody for your help!
I can't find any example of how to merge pdfs in browser on frontend. Sorry for asking but what should be inside of pdfsToMerge ? Is it an array with urls of pdf files ?
@sgt-madcap Here's a modified version of the example I shared in https://github.com/Hopding/pdf-lib/issues/252#issuecomment-566063380 that demonstrates how to merge two PDFs into a single document:
const mergedPdf = await PDFDocument.create();
const url1 = 'https://pdf-lib.js.org/assets/with_update_sections.pdf';
const url2 = 'https://pdf-lib.js.org/assets/with_large_page_count.pdf';
const pdfABytes = await fetch(url1).then(res => res.arrayBuffer());
const pdfBBytes = await fetch(url2).then(res => res.arrayBuffer());
const pdfA = await PDFDocument.load(pdfABytes);
const pdfB = await PDFDocument.load(pdfBBytes);
const copiedPagesA = await mergedPdf.copyPages(pdfA, pdfA.getPageIndices());
copiedPagesA.forEach((page) => mergedPdf.addPage(page));
const copiedPagesB = await mergedPdf.copyPages(pdfB, pdfB.getPageIndices());
copiedPagesB.forEach((page) => mergedPdf.addPage(page));
const mergedPdfFile = await mergedPdf.save();
Unhandled Rejection (TypeError): pdfDoc.embedJpg is not a function (anonymous function) D:/bugsynext/src/components/employees/employeeProfile/EmployeeImmigrationWF/DocumentsUploaded/DocumentsUpload.jsx:148 145 | 146 | // Load a PDFDocument from the existing PDF bytes 147 | const pdfDoc = PDFDocument.load(existingPdfBytes)
148 | const jpgImage = pdfDoc.embedJpg(jpgImageBytes) | ^ 149 | const pngImage = pdfDoc.embedPng(pngImageBytes) 150 |
Unhandled Rejection (TypeError): pdfDoc.embedJpg is not a function (anonymous function) D:/bugsynext/src/components/employees/employeeProfile/EmployeeImmigrationWF/DocumentsUploaded/DocumentsUpload.jsx:148 145 | 146 | // Load a PDFDocument from the existing PDF bytes 147 | const pdfDoc = PDFDocument.load(existingPdfBytes)
148 | const jpgImage = pdfDoc.embedJpg(jpgImageBytes) | ^ 149 | const pngImage = pdfDoc.embedPng(pngImageBytes) 150 |
@Hopding @matthopson @thommath @kevinswartz @mlecoq @philipjmurphy @matthopson @kevinswartz please help me out of this
If anybody would like to work with me to help add support for PDF decryption to pdf-lib, I'd greatly appreciate it!
@Hopding is it still open :-)? any guidelines for the starters please(new to pdf spec). I was looking into this doc.
you made that comment on 2019. Its 2023. @Hopding have you added support for default password pdfs?
@cjblb19 It sounds like you are having two distinct issues:
- You're able to merge PDFs without any errors, but the resulting
mergedPdf
only contains a single blank page- You're unable to load certain PDFs due to an encryption error
Is this summary correct?
I looked at the PDF file you shared, and it is actually encrypted. You can verify this with Acrobat Reader:
This is understandably confusing. After all, how can a document be encrypted if you're able to open it in your browser without entering a password? You can do this because the PDF specification actually defines a default password:
This default password is what Chrome (and other readers) use to decrypt the PDF - thus allowing you to view an encrypted PDF without entering a password.
Encrypted documents are one of
pdf-lib
's weak points. It turns out that there are a lot more encrypted PDFs out there than you might think. But you'd never know, because many of them use this default password. Support for PDF decryption is high on my list of features/enhancements. However, I do not plan to include it in the v1.0.0 release.If anybody would like to work with me to help add support for PDF decryption to
pdf-lib
, I'd greatly appreciate it! There's lots of work that can be done to makepdf-lib
even better than it already is. But the pace of feature development currently depends primarily on how much time I have available to devote to it.
Hi, is there a way to decrypt the default password?
@jerp @gregbacchus @mlecoq @kevinswartz @Jonathan-Mckenzie @philipjmurphy @DanielJackson-Oslo @matthopson @ithillel-aminev @vitaly-zdanevich
Hello everybody! Today I'm excited to announce the beta release of
pdf-lib
v1.0.0!You have all provided extremely valuable feedback on
pdf-lib
over the past year. This feedback has highlighted several architectural flaws in the design ofpdf-lib
(e.g. the inability to load invalid PDFs). It has also brought to my attention some critical shortcomings inpdf-lib
's feature set (e.g. no methods for getting the width/height of a page).I've responded to these issues with specific workarounds and fixes as they've been reported. But I've also been thinking about how to solve them in a more wholistic way.
With this goal in mind, I've been hard at work these past few months working on a complete rewrite of
pdf-lib
. This rewrite is now mostly complete. All that remains is to write documentation and implement a few small features.I'd like to ask for your help in beta testing this rewrite. I've written extensive automated and manual tests, and have verified that everything works in all major PDF readers (Acrobat, Foxit, Preview) and browsers (Chrome, Firefox, Safari). But since this rewrite is so extensive, I do not want to do a full release until others have been able to test it.
I'm working on a complete changelog, but it might be a week or two until I am able to complete it. So, in the meantime, here's a list of the main changes/improvements in v1.0.0:
The README for the rewrite can be found here: https://github.com/Hopding/pdf-lib/tree/Rewrite.
You can install the beta version of v1.0.0 with
npm
:or
yarn
:It's also available on the unpkg CDN:
Those of you that intend to participate in this beta test, please post a comment in this thread to let me know! If you do not plan to participate, please just ignore this (I understand that not everybody will be able to participate).
I'd like to keep all issues and discussion pertaining to the beta test centralized in this thread. However, if you need to communicate privately with me, please feel free to email me at
andrew.dillon.j@gmail.com
.I appreciate your help and am looking forward to a successful release of v1.0.0!