galkahana / HummusJS

Node.js module for high performance creation, modification and parsing of PDF files and streams
http://www.pdfhummus.com
Other
1.14k stars 169 forks source link

PDF Merging slow #300

Open phal0r opened 6 years ago

phal0r commented 6 years ago

Hello guys,

first of all thanks for this module. It is really nice to use and there are lots of information and examples on how to accomplish things. Currently we are creating print pdfs on the fly and are merging therefor several single-paged pdfs to one complete document. In our tests, this is quite slow. Every page that gets appended to the complete document takes about 400-500ms on a core i7. Is there a pattern to speed this up, since we do not modify the single-paged pdfs, it is actually just an append operation.

Thanks in advance!

galkahana commented 6 years ago

should be mostly IO...so try a fast drive. since these are one pages that you are merging there's little possible sharing between them, and they need to be parsed each. depending on how you generate the on pages pdfs you may gain something by batching them to multi page pdfs and merging them together. but i'm fairly sure that the biggest help would come from improving IO rates, which means having them on a local fast drive.

phal0r commented 6 years ago

@galkahana Thanks for the answer. My current implementation uses just Buffers. So everything resides in RAM already. Looking at the system resources, the limiting factor is the cpu. I have a Core i7, which should be quite fast.

I tried an internal webservice, which does something similar, but with apache pdfbox and it takes ~2-3 seconds to do the same, so I was wondering, if there might be some smart optimizitations, like parsing only the container format of the pdf (don't even know if this possible, but my assumption ist, that the parsing part takes most cpu time) and appending it to the new pdf.

The PDFs exist already, so it's not possible to pre-batch it at some point. I could spawn another process and split appending them, so I can use more cpu cores, but before I wanted to know if the process itself can be smarter somehow :)