boazsegev / combine_pdf

A Pure ruby library to merge PDF files, number pages and maybe more...
MIT License
735 stars 157 forks source link

lazy load of pdf document #42

Closed asiniy closed 8 years ago

asiniy commented 8 years ago

Hello! I want to use combine_pdf for parsing a ~ 300 pages pdf. But when I try to load this relatively big document via CombinePDF.load() memory usage increases drastically and it returns me a memory allocation usage: broken pipe(). It looks like I need to load just 1st page, edit it, then 2nd page, edit it and so on. Is there any ability to do it?

boazsegev commented 8 years ago

Hi Alex,

I'm sorry to say that this feature isn't available at the moment...

PDF files have a complex structure with binary references to different objects within each file. It is common for one or more pages to share these resources (such as fonts), allowing the same binary font data to be used more than once.

This sharing of data between pages is done by using object references to the binary data - somewhat like C pointers.

At the moment, CombinePDF scans the whole file from "top to bottom", instead of following the binary references to each object. This helps CombinePDF avoid data duplications, since CombinePDF simply marks the references instead of jumping from the references to the actual data (although during the parsing process, some limited data duplication is unavoidable due to String allocations)...

... This means that I cannot add this feature without rewriting the parser completely :-/

It might have been possible to avoid loading the whole file to the memory before phrasing it... but this too would now require a major rewrite to the parser (which is currently using Ruby's strscan extension and would need to utilize some type of IO based objet instead).

I should point out that I have combined single PDF files with thousands of pages (I think the largest one I remember from early production, when I was still testing these things, was more then 5,000 pages of court proceedings with some being text and some scanned page images... which is crazy)...

It could be that the issue is that the PDF contains a lot of "heavy" data, such as images, or that the PDF is encrypted (requiring CombinePDF to decrypt some of the data and causing duplications)... I have no idea if it's possible to "shrink" the PDF first, using a different tool...

I wish you the best of luck, I wish I had a better answer... Bo.

asiniy commented 8 years ago

Hello Bo,

thanks for detailed explanation. The file looks pretty small, so I can't understand why it haven't loaded...

Alex

boazsegev commented 8 years ago

Do you experience the same issue with other files? Maybe combine a few files to get another ~300 page file and see?

If the issue is only with the file, could you send me the file so I can have a look at why it's not working?

I'm a bit busy these two weeks, but maybe I could find out what happened.

asiniy commented 8 years ago

Other files are processed well...

I sent an agile.pdf file to Boaz@2be.co.il.

Thanks!

boazsegev commented 8 years ago

Hi Alex,

I just wanted to let you know that the fix for issue #49 was apparently related to the agile.pdf, or at least it seems that way.

I can open and save a copy of agile.pdf using the new version... but I know the links in agile.pdf will break when I combine it with other files (TOC merging isn't supported yet)...

I hope this helps.

asiniy commented 8 years ago

Hello Bo!

Thanks a lot!