PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

cannot add outlines to a document with outlines #207

Open az143 opened 10 months ago

az143 commented 10 months ago

this looks like a reopen of #67, for pdf::builder 3.025 on perl 5.36.0.

symptom: when adding an outline to a document that already has any, the newly added ones vanish (i cannot find any visual evidence of their existence using xpdf and browser-based rendering). it gets weirder: depending on whether outlines->count is called or not it may report the expected grand total or not, and the resulting files differ in size.

to reproduce: i used sample_55.pdf and the 055_outlines example as shipped with pdf::builder, and split the outline creation into two phases: add one outline, save interim file, open that, add another two, save. final result: just one outline is visible.

the attached 055_outlines-modify-fails does not call outlines->count after reopening, and the final count reported is 2 (should be 3).

perl ./055_outlines-modify-fails
after adding one, doc reports 1 outlines
after adding two more, doc reports 2 outlines
saving as examples/055_outlines.sample_55-dudcount.pdf

the attached 055_outlines-modify-fails-but-counts-ok does call outlines->count after reopening and the final count is now 3.

perl ./055_outlines-modify-fails-but-counts-ok 
after adding one, doc reports 1 outlines
after reopening, doc reports 1 outlines
after adding two more, doc reports 3 outlines
saving as examples/055_outlines.sample_55-okcount.pdf

in both cases the final document only shows the very first outline. the file sizes differ, too. the test scripts are attached (renamed .txt to appease github), ditto resulting pdfs and the interim one.

please let me know if there is anything else you'd like me to do to pinpoint and fix this one.

055_outlines-modify-fails.txt 055_outlines-modify-fails-but-counts-ok.txt 055_outlines.sample_55-dudcount.pdf 055_outlines.sample_55-okcount.pdf 055_outlines.sample_55_oneoutline.pdf

PhilterPaper commented 10 months ago

Thank you for the report. I will take a look at it over the next day or two, and at least confirm what you've found and whether it was any sort of usage error. I see you have attached some .pl code to demonstrate the problem (I find it clearer to suffix the program with .txt for uploading, rather than replacing .pl with .txt -- I almost went and asked you to supply the .pl code).

PhilterPaper commented 10 months ago

I can confirm that it's not working correctly (i.e., a bug). After a careful reading of the code and some experimenting, the first thing I want to do is update the POD to show how to do multi-level (nested) bookmark lists, where $parent is given in the outline() call. It does appear to work; it's just undocumented. Then I want to figure out how to place a new bookmark at some other place than the end of the list ($prev value given is apparently supposed to do this), which could be very useful if you're updating a bookmark list and want to insert a new bookmark in-between two existing ones. I may have to extend the code to handle inserting a new bookmark (Outline) before the existing first one. Then I have to confirm that open and close of a bookmark does the right thing for a Count of descendants.

Once all that is working, I will have to look at what's going on in the simple update case you gave. I can already see that the Outlines root is not updated when you add more top-level Outline objects, so that's something to look at.

PhilterPaper commented 4 months ago

@vadim-160102, does this one sound related to your ssimms/pdfapi2/issues/83 (appended below:)? I have gotten stuck on this one, and hope you might have some insights!

I totally forgot about this.


use strict;
use warnings;
use feature 'say';
use utf8;
use Test::More;
use PDF::API2 2.047;

my $pdf;

########################################################

$pdf = PDF::API2-> new; $pdf-> page; $pdf-> outline-> outline for 1 .. 2; $pdf-> outline-> last-> delete; is $pdf-> outline-> count, 1, 'looks like we successfully deleted the last item...';

$pdf = PDF::API2-> from_string( $pdf-> to_string ); is $pdf-> outline-> count, 1, '... (save, re-open and check to be sure)';

########################################################

$pdf = PDF::API2-> new; $pdf-> page; $pdf-> outline-> outline-> title( 'bookmark' ); $pdf = PDF::API2-> from_string( $pdf-> to_string );

my $title = eval { $pdf-> outline-> first-> title }; is $@, '', 'we can examine bookmarks in existing file';

$pdf-> outline-> count; # fix for test #3

$title = $pdf-> outline-> first-> title; is $title, 'bookmark', 'yes we can';

########################################################

my $chinese = '中國的'; $pdf = PDF::API2-> new; $pdf-> page; $pdf-> outline-> outline-> title( $chinese ); $pdf = PDF::API2-> from_string( $pdf-> to_string );

$pdf-> outline-> count; # fix for test #3

is join( ' ', unpack 'U', $pdf-> outline-> first-> title ), join( ' ', unpack 'U', $chinese ), '...and read interesting characters';

########################################################

$pdf = PDF::API2-> new; $pdf-> page; $pdf-> outline-> outline-> title( $_ ) for qw/ a B c D e F /; $pdf = PDF::API2-> from_string( $pdf-> to_string );

$pdf-> outline-> count; # fix for test #3

my $bm = $pdf-> outline-> first; while ( $bm ) { my $next = $bm-> next; $bm-> delete if $bm-> title =~ /\p{Lu}/; $bm = $next } $pdf = PDF::API2-> from_string( $pdf-> to_string );

is $pdf-> outline-> count, 3, 'we can modify outline in existing file and save changes';

########################################################

done_testing;

The output:

ok 1 - looks like we successfully deleted the last item... not ok 2 - ... (save, re-open and check to be sure)

Failed test '... (save, re-open and check to be sure)'

at outline.pl line 19.

got: '2'

expected: '1'

not ok 3 - we can examine bookmarks in existing file

Failed test 'we can examine bookmarks in existing file'

at outline.pl line 29.

got: 'Can't locate object method "title" via package "PDF::API2::Basic::PDF::Objind" at outline.pl line 28.

'

expected: ''

ok 4 - yes we can not ok 5 - ...and read interesting characters

Failed test '...and read interesting characters'

at outline.pl line 45.

got: '254 255 78 45 87 11 118 132'

expected: '20013 22283 30340'

not ok 6 - we can modify outline in existing file and save changes

Failed test 'we can modify outline in existing file and save changes'

at outline.pl line 67.

got: '6'

expected: '3'

1..6

Looks like you failed 4 tests of 6.

2 files to patch:

--- PDF/API2_orig.pm Sat May 18 21:19:41 2024 +++ PDF/API2.pm Sat Jun 22 11:28:39 2024 @@ -833,6 +833,7 @@ bless $obj, 'PDF::API2::Outlines'; $obj->{' api'} = $self; weaken $obj->{' api'};

--- PDF/API2/Outline_orig.pm Sat May 18 21:19:41 2024 +++ PDF/API2/Outline.pm Sat Jun 22 11:27:46 2024 @@ -90,26 +90,24 @@ if ($count) { $self->{'Count'} = PDFNum($self->is_open() ? $count : -$count); }

@@ -124,6 +122,9 @@ if (defined $self->{' children'} and defined $self->{' children'}->[0]) { $self->{'First'} = $self->{' children'}->[0]; }

@@ -140,6 +141,9 @@ if (defined $self->{' children'} and defined $self->{' children'}->[-1]) { $self->{'Last'} = $self->{' children'}->[-1]; }

@@ -154,7 +158,7 @@

sub parent { my $self = shift();

@@ -167,8 +171,11 @@ =cut

sub prev {

@@ -181,8 +188,11 @@ =cut

sub next {

@@ -208,6 +218,7 @@ $self->{' api'}->{'pdf'}->new_obj($child); }

@@ -268,6 +279,7 @@ $item = $item->next(); push @{$self->{' children'}}, $item; }

@@ -286,12 +298,20 @@

 my $prev = $self->prev();
 my $next = $self->next();

@@ -403,6 +429,7 @@ $self->{'Dest'} = PDFStr($destination); }

@@ -443,6 +470,7 @@ $self->{'A'}->{'S'} = PDFName('URI'); $self->{'A'}->{'URI'} = PDFStr($uri);

@@ -465,6 +493,7 @@ $self->{'A'}->{'S'} = PDFName('Launch'); $self->{'A'}->{'F'} = PDFStr($file);

@@ -511,6 +540,7 @@

 $self->{'A'}->{'D'} = _destination(PDFNum($page_number), $location, @args);

I understand the PDF::API2::Outline::count() does not "count" descendants/children, but its use is OK for these tests. Further, it may seem that tests 1-2 are superfluous and their case is covered in later tests, but that is not exactly true. The later tests deal only with accessing outline in existing (i.e. opened with PDF::API2) files. The 1st test also produces invalid PDF syntax, i.e. broken outline tree, in freshly generated file. The same happens when deleting the first outline item, but then the "phantom" bookmark is somehow "invisible" i.e. ignored by Reader, PDF::API2, etc. The new first/last item still have their prev/next attribute set and pointing at phantom (deleted) item. (Perhaps "Outline tree" has unnecessary redundancy and is never checked in full by readers/browsers?) As to "why would anyone want to generate a fresh tree and immediately delete the first/last item" -- I don't know. Test #3 fails because objects aren't blessed into correct class. Effectively, the PDF::API2::Outline::count() will read the entire tree and re-bless (therefore test #2 did not result in fatal error); this call is injected in the above script to enable tests 4-6; and is later used in the patch, seemingly as a no-op. I'll further comment on changes if necessary. Some were discovered only later when fixing and testing for most obvious ones. Maybe I missed something.

Edit: I think I missed something indeed! (Rubber duck strikes again.) The count() can't reliably be used to re-bless because it doesn't follow items which are "closed". Perhaps dedicated internal traversing sub will be required.

Edit 2 : So, then, the sub below should be added/appended to PDF/API2/Outline.pm; and line patched in PDF/API2.pm should be a call to this sub i.e. $obj->_fix_on_open();

sub _fix_on_open {
    my $self = shift();

    if ($self->has_children()) {
        $self->_load_children() unless exists $self->{' children'};
        foreach my $child (@{$self->{' children'}}) {
            $child->_fix_on_open();
        }
    }
}
vadim-160102 commented 4 months ago

Both the public interface and its documentation are sufficient to read/write outline trees, not sure why you think otherwise. The patch should fix bugs as reported here in the ticket, it's easily verifiable and not insightful neither. But looking at "055_outlines.sample_55-okcount.pdf", how about this for an insight: the PDF::API2::open_page() call is surprisingly intrusive, and even destructive.

I think one can reasonably expect that open_page() as argument for PDF::API2::Outline::outline() would only be used to generate an indirect object reference like 12 0 R, and it is read-only operation. It is not.

Another example: suppose I want to add a page to an existing PDF document. I'd like to know the media size of pages already present there. Can't think of anything else but e.g. $pdf->open_page(1)->size() (or whatever moniker the latter method goes by now) as officially recommended way to do that. This call should be read-only, right? Wrong. The saved document with new appended page reveals that the 1st page was also stuck after original %%EOF. Not only that, but it was modified.

For grotesque exaggeration, try:

use strict;
use warnings;
use PDF::API2 2.047;

my $pdf = PDF::API2-> new( compress => 0 ); 
$pdf-> page-> graphics-> save-> restore;
for ( 1 .. 10 ) {
    $pdf = PDF::API2-> from_string( $pdf-> to_string, compress => 0 );

    $pdf-> open_page( 1 )       # no-op? nope
}
$pdf-> save( 'visual_test.pdf' );

and inspect the final file in your text editor which has "show non-printing characters" on for full (hilarious) effect. A call to open_page() will:

If "array of streams" is the syntax choice of how PDF::API2 generates new content -- fine, it's perfectly OK. What I don't understand is why to touch streams already present and inject changes I didn't ask for. E.g. such as i.e. amongst them the nounrotate no longer documented and therefore forcibly disabled.

I checked as far back as 0.4 version which was 20 years ago. Looks like the above was there from the very beginning. Perhaps too late to ask "why?". It won't be too difficult to fix to how I think it should behave. However, (a) leaving the benefit of a doubt, of course it is possible I don't get it right, there are valid reasons for everything; and (b) reaction if any arrives in a few months or years judging by past experience, and probably "no, thanks".


Uhm, OK, if that was not an insight, here's another very short and easy one. Though unrelated to PDF::API2::Outline at all. The TrimBox (same with Bleed- and Art-) is not an inheritable property. Please check the Reference. The PDF::API2::trimbox() (deprecated/renamed but OK as quick referent here) writes something to a file which is just ignored by any compliant reader. There is no such thing as "global trimbox". Looks like it also was there for 20+ years.

PhilterPaper commented 4 months ago

Both the public interface and its documentation are sufficient to read/write outline trees, not sure why you think otherwise.

Uh, I don't. I did go through and update and clarify the PDF::Builder documentation a bit, particularly with regards to creating multi-level outlines. The problem reported in this ticket concerns adding new outlines (bookmarks) to an existing PDF, which doesn't work. Your API2 ticket sounded like it might be related, and I was hoping you'd have some insights into that. It would be great if your fix also works for this ticket. Unfortunately, I have to get 3.027 out the door before I can resume work, and try your fix.

As for the "intrusive/destructive" calls you reported, I will have to look at them carefully. I may be back with more questions. Updating a PDF with new content does seem to create unnecessary updates to existing content, but I want to understand it thoroughly before changing anything.

I'm surprised to hear about the Trim- Bleed-, and Art-Box issues. I'll have to review the standard and decide what (if anything) to do.

vadim-160102 commented 4 months ago

Sorry I wasn't clear: "easily verifiable" means of course I did check it, and ticket's code runs as expected with PDF::API2 if the patch is applied.

+I think PDF::API2 was right when it removed any mention of PDF::API2::Outline::new() from its documentation; one of the positive developments. I.e. the constructor is not public and not supposed to be called by the end user.

+as kind of proactive answer to questions you haven't asked yet: the Reference is clear that consumers are free to modify Contents structure i.e. to/from array or any number of items in said array on read/write. But I read it of course that such freedom is granted if content is modified or file is simply re-saved and everything regenerated from scratch (cleanoutput()) which PDF::API2 can't do. And not that it means Contents is bound to be modified by simple fact of getting a reference to page's object and then stuck after %%EOF.

PhilterPaper commented 4 months ago

That's great news that your fix covers the problem reported at the top of this ticket, PDF::Builder # 207, as well as PDF::API2 # 83. At least, that's what it sounds like you're saying! After I get a few nagging items wrapped up for PDF::Builder, I'll try to slip in your fix and confirm that it fixes this one. That would be worth a few more days' delay. Thanks!