[RT 113516] After slurping the PDF file into memory, a copy is immediately made -- is it necessary?

PhilterPaper commented 7 years ago

Sat Apr 02 08:00:06 2016 futuramedium [...] yandex.ru - Ticket created Subject: After slurping the PDF file into memory, a copy is immediately made -- is it necessary?

The PDF::Builder->open() reads the entire file, and then passes this scalar to 'open_scalar' i.e. a copy is made. Then:

D:\>perl -MPDF::Builder -Mstrict -wE "my $f = PDF::API2->open('520_Mb_file.pdf')"
Out of memory!

, obviously. I think the 'open_scalar' could be left as is because it always was there. An 'open_scalar_ref' could be added, with minor obvious modifications, and 'open' re-written to use 'open_scalar_ref' instead of 'open_scalar'. Then the above command completes without error, and everything else seems to work OK.

Further, I think, because PDF::Builder::Basic::PDF::File->open() accepts an 'update' parameter (looks like a contract to never re-use (if 0) this internal scalar, never to 'stringify', etc.), and also because it can accept filename instead of in-memory file filehandle -- there could be some kind of flag implemented, to use a disk file instead of slurping a file and using in-memory file. Performance will suffer, but at least, for really huge files, we would be able to extract pages, resources, etc.

PhilterPaper commented 7 years ago

on Fri Aug 18 13:17:16 2017 steve [...] deefs.net - Correspondence added

I've just spent some time looking into this. The core PDF code in PDF::API2::Basic should work fine if it's passed a filehandle -- I don't think there's any code in there that needs a copy or an in-memory version of the PDF.

It should be possible to modify PDF::API2->open() to pass the filehandle to PDF::API2::Basic::PDF::File rather than slurping it into memory and calling open_scalar. The only other subs that may need to be changed are finishobjects, save, saveas, and stringify, which are all in PDF/API2.pm as well. I don't think any other file would need to be touched, so it should be a fairly straightforward patch, and I've just renamed a couple of variables and cleaned up a bit of code to make it easier to implement.

My inclination is that the default/expected behavior should be to read the file as needed, with an option to slurp it into memory for performance reasons, rather than having the current behavior be the default. # Fri Aug 18 13:17:17 2017 The RT System itself - Status changed from 'new' to 'open' # Fri Aug 18 13:18:42 2017 steve [...] deefs.net - Broken in 0.59 added # Fri Aug 18 13:18:42 2017 steve [...] deefs.net - Broken in 2.027 deleted

PhilterPaper commented 7 years ago

on Thu Aug 31 07:22:31 2017 futuramedium [...] yandex.ru - Correspondence added

Thank you for your attention to the problem. I hope I'm finally over with worrying about 32-bit OSes, so slurping is now less of an issue. Maybe for others, too. Just a note for your list of priorities :).

Please note, however, that part of what I suggested above (passing reference instead of scalar) was safe and cheap (no new tests required) way to reduce to 50% memory footprint while reading any file.

<aside> Memory and time required to read/slurp (be it 50% or original 100%) may be tiny compared to efforts to parse and build Perl structures for complex files. E.g., using brand-new i5-7500 machine:

perl -MPDF::API2 -E "PDF::API2-> open('PDF Reference 1.7.pdf'); say time - $^T" 
366

Then why care about slurping? OK, maybe new ticket should be opened to address such performance. </aside>

While, what you are suggesting may require a bit more planning and effort. E.g., now I think that benefits of no-slurping will be void if proper incremental update isn't implemented. The current "update" should be re-written, to append to original file. + It should work with multiple updates while object persists. + Issue when updating a file with XRefStream should be addressed, too.

PhilterPaper commented 6 years ago

Referenced by RT 122962 (#78),

PhilterPaper commented 3 years ago

A few days ago (December 8), Steve added code to PDF::API2 to not read in the whole file at once, but to read it as needed. As I don't quite understand his reasoning, and the code is complicated and undocumented, I think I will let this one lie until it's been tested in the field for a few months (2.039, whenever that arrives) before I put the code into PDF::Builder. Another update a few days later (December 16) makes the code behave when a PDF file is read-only. Both updates get rather tangled up with some other updates I've made, so it will be a major project to figure them out and put them in.

Anyway, it's possible that this new code will be the fix for this ticket, as it basically seems to address the problem.

PhilterPaper commented 3 years ago

Fri Feb 26 10:31:19 2021 steve [...] deefs.net - Correspondence added

PDF::API2 no longer reads the entire file into memory before working on it.

(This code has not yet been put in PDF::Builder -- I want to see if it causes problems in PDF::API2 first, and it's also quite complicated and conflicts with a number of my changes)

PhilterPaper commented 3 years ago

See #170 for possible problem with this change (large file partial read).

PhilterPaper / Perl-PDF-Builder

[RT 113516] After slurping the PDF file into memory, a copy is immediately made -- is it necessary? #34