Delete, discard, overwrite pages

PhilterPaper commented 2 years ago

@sciurius wrote for ssimms/pdfapi2/issues/56:

While creating a PDF document I create pages sequentially. At a certain point I decide that the last couple of pages that have been written are wrong, and I want to discard these and then add new pages. Since PDF::API2 does not have a remove_page method, what would be 'the right' way to obtain this? Do I need to manually manipulate the pagestack?

I suspect that it's little more than deleteing the appropriate page items in the pagestack (and of course, their children if getting rid of the root doesn't completely clean up), and keeping any current or next page pointer or counter updated. I haven't looked into it yet.

That would be good to have a formal method for: $pdf->delete_last_pages($count);. According to my Road Map for PDF::Builder, I also want the ability to move objects around on a page and between pages, as well as re-order pages, so any page deletion should be "compatible" with that.

sciurius commented 2 years ago

Thanks for the enhancement label. To me it feels like an omission to not be able to delete pages.

PhilterPaper commented 2 years ago

sciurius wrote:

This is what I'm trying to achieve. I create the PDF document chunk-wise. Each chunk is one or more pages. Before each chunk there may be one blank pages for page alignment. Now if there is a blank page (I know that) and the chunk is two pages, I do not want the blank page. So I want to discard the last three pages, and re-process the chunk. Re-processing is necessary since the page headers and footers will be different (left/right, page numbers).

The hard way is to process each chunk to a temporary document, and then process it again to the main document, first inserting a blank page unless it is two pages.

Maybe there are other (better) ways to achieve this?

If I understand you correctly, you produce your output, and if it's an odd number of pages, you want to insert a blank page at the beginning of the chunk, but if even, insert nothing extra? It sounds like standard practice with Chapters starting on a Right-Hand page. Is that what you're trying to do? If so, why wouldn't you take care of this with a blank page at the end of the previous chapter (if it ends on a Right Hand page) by adding a blank page, and then you don't have to go back and re-do so much work? I must be missing something here...

sciurius commented 2 years ago

Normally, a song (chunk) starts on the right page. If necessary, a blank page is added before it. But when the song spans two pages, it is better to start it on the left page so the complete song can be viewed at once. Unfortunately we do not know how many pages the song has until it has been processed.

PhilterPaper commented 2 years ago

Other than the headers, page numbering, etc., is there any difference in the formatting of a song whether it starts on a left page or a right page? If not, what about the following:

Write page one of the chunk (song) without headers or page numbers.
If needed, write page two in the same manner.
If a single page, insert a blank page before page one.
Edit page one and add headers and other left/right-dependent stuff.
Edit page two and add headers etc. as with page one.

If that's not feasible (the formatting of the song itself depends on left or right page start), can you make a first pass through the song, without yet putting ink on the paper, and determine if it will spill onto a second page? I would presume that you have the content before starting, so could you make a dry run and see if two pages are needed? I'm assuming that you are writing out the song one line at a time, rather than using a paragraph-fill method.

sciurius commented 2 years ago

That was what I initially intended to do. But editing in the page headers/footers turned out to be a nasty can of worms so I dropped this approach. A "first pass through the song, without yet putting ink on the paper" would be a production pass with a dummy PDF output document, hence almost doubling the processing time. ... while discarding pages of the current document seems sooooo attractive.

PhilterPaper commented 2 years ago

OK, into the hopper it goes. It seems like a nice project for someone to take a crack at, if someone wishes to issue a PR against PDF::Builder. Otherwise I'll try to get to it in a near-term release.

a production pass with a dummy PDF output document, hence almost doubling the processing time.

I don't buy that figure. In the prepass, you will not be doing headers and footers, nor final formatting and writing to PDF, so it should be a lot less than that (double). Plus, you can quit that pass immediately once the song is found to exceed one page. It may come down to whether a single page song is rare, or a two-page song is rare, or if both are common. I don't know how you want to handle three+ page songs. And can you save layout and formatting information from the prepass, so that the second pass goes much more quickly (not doing everything over again, so long as you do the complete song on the prepass)?

PhilterPaper commented 2 years ago

See #95. Before I start implementing anything for page/object deletion, I need to think about a more general look at possibly caching the last several pages (before being written to output file) and being able to modify existing objects. For instance, the code discovers a widow (last line of the paragraph on the next page). Two solutions: 1) squeeze this page's leading a bit, and move the widowed line up to the end of this page, or 2) move the last line on this page down to the next page (also repositioning the widow) and open up the leading a bit on this page. All sorts of pitfalls can arise, so this needs to be considered carefully. It may take keeping the "written" objects in-memory, in a form that's easy enough to locate and move/modify them, and not actually writing them out until the oldest cached page needs to be output. Something like that.

sciurius commented 2 years ago

I'm inclined to say that deleting or overwriting pages is totally unrelated to manipulating objects on pages.

PhilterPaper commented 2 years ago

I just want to make sure that if I do something for deleting objects (or pages), that it doesn't make any further object/page manipulation (edit, move) difficult or a mess. For example, the widow/orphan type cleanup I mentioned above. One thing would be whether to tentatively write to a cache, rather than directly to the final data or even the file. If the application has a means to keep track of where stuff is, it could go back and selectively erase items, or edit them (e.g., reposition) or move them between pages. I just want to think about this carefully so that it's a unified and consistent thing, rather than a hodgepodge of hacks.

I suspect that deleting a page is more than just delete $page -- you need to delete everything that is a child of that page, the page itself, and then clean up any entries referring to that page, such as the pagestack. I'm not sure if all of that is exposed to an application (such as ChordPro), and even if it is, there are a lot of details to attend to.

Finally, this brings up the whole philosophical discussion of how much effort should an application (e.g., ChordPro) put into doing things tentatively (i.e., easy to back out), and how much PDF::Builder should assist in modifying things already written out. In the former case, would it be better to make two passes through a song, the first the do rough layout and determine the number of pages (and thus their alignment), and the second pass to actually put ink to paper?

sciurius commented 2 years ago

AFAICS, the PDF document is a tree structured object, where all references are indices in the xref list. Removing a page would require processing the page subtree and mark all objects as TBD (to be deleted). Since objects can be referred to by other pages (I assume), some kind of refcounting is required. So a first pass to count the refs, then remove the TBD objects that have a refcount of 1. Think mark-and-sweep GC. Not a hard job but it could be a nasty one.

So the question is, indeed, valid whether it is worth the trouble. For the particular case at hand, arranging ChordPro pages, I do have an alternative since the page headers and footers are added to all the song pages at the end of each song, when it is known how many pages the song will occupy.

But it seems somewhat logical to have a 'delete this page' operation.

Maybe that is the problem with this world: We known how to produce, but not how to neatly and cleanly delete.

PhilterPaper commented 2 years ago

Maybe that is the problem with this world: We known how to produce, but not how to neatly and cleanly delete.

Amen, Brother! (I assume that you're waxing philosophical here, and not being literal about programming.) Reduce, Reuse, Recycle (in decreasing order of preference) applies (at least, Reduce) in software, too. I'm trying to think of how to best help users (application developers) to "test write" their output so that it can be laid out first without a hard write commitment, and thus no need to delete already-written material. Of course, I can leave it to the developer to handle that themselves, and only "write" with a permanent marker, but is there anything that PDF::Builder could do to help here?

I think the entire document is built in memory, and written out in one shot (I need to check that). To allow editing (moving, changing, deleting) of objects, some way would need to be found to label parts (typically, by the application) so that the application could refer to them. The labels would be discarded during the write. I'm still mulling this over, and haven't decided on anything yet. As I said before, if I implement a page delete, I want the "experience" to be consistent with whatever other kinds of editing might be permitted, such as moving lines between pages and adjusting leading, to deal with widows and orphans. Even that is not simple if columns are not simple rectangles (line lengths changing, meaning a change in per-line content, not just moving the y coordinate up and down).

PhilterPaper commented 1 year ago

ssimms wrote (PDF::API2):

Some thoughts:

If I were writing this code (and without knowing the full context), I'd do two passes -- one to determine where blank pages are needed, and a second to do the actual generation. That may or may not be feasible in your case.
The PDF-preferred way to do this in one pass, based on my understanding of the spec, would be as you describe -- write the pages, then modify the page tree so that the unneeded pages are ignored (though not deleted). Unfortunately, PDF::API2 doesn't currently have handy tools for removing content or modifying the page stack other than adding/inserting pages.
Would you like to draft a $pdf->remove_page($page_number) patch? The potential gotcha is that pages are stored in a tree rather than something simple like an array (see PDF 1.7 section 7.7.3 Page Tree). Other than that, it's probably not all that difficult to implement.

PhilterPaper commented 1 year ago

Johan,

Perhaps you could specify what sort of things you envision doing (what you would find handy) with deleting or moving not only pages, but parts of pages. It might be possible to mark and delete (or move, or add x,y offsets to) the "last N items written to a given text or graphics context". It might even be possible to locate and delete/move/replace/edit/insert even a single object, when counting from either the beginning or end of a text or graphics context.

Of course it can be argued that none of this would really be needed with careful planning up front, but if you feel it would be useful, what sort of things might you do, and what would the interface look like? One of the biggest problems with trying to modify part of a context is knowing what items there are -- correlating Builder-level calls to what sort of stuff (possibly multiple PDF primitives) resulted from them.

For my current work, something along these lines might be useful for dealing with widows and orphans, when you find you need to reduce or increase leading to move a line or two of text from one page to another. I'm not sure how to do that cleanly and reliably without "marking" items in the context, such as the start of each line on the page. And when there's other stuff mixed in to an already-rendered page, such as illustrations, rules, inserts/non-rectangular columns, etc. it could get very interesting. I'm open to ideas, including "trial writes" to pages, where nothing is actually written to the PDF output until you're satisfied with the page layout, and include some sort of labeling or other IDs to mark content for deletion. moving, updating coordinates, etc. This might take the form of a page cache, where the last N pages are kept in virtual form, before the oldest is considered final and is written out (to the data structure, and eventually to the file).

Needless to say, this would be a major upgrade to PDF::Builder, so I want to solicit ideas first. I've got a lot on my plate right now, and probably can't even begin to get to it until some time next Spring (at earliest). In the meantime, I'd like to hear what users might want to do with such capabilities. I don't want to add random facilities such as "delete a page" in a haphazard manner, without thinking how it might integrate to more sophisticated content-management capabilities.

P.S. How about editing a PDF document read in from file? Deleting/moving/changing content that already exists. Could this be handled the same way as freshly-produced content?

sciurius commented 1 year ago

At the API level, the page is the smallest object that can be uniquely identified. My scope is to manipulate pages. More specific, to be able to incidentally discard an alignment page when I decide that it is not wanted. I described the situation of a book (songbook) where each chapter (song) starts at a right page so an alignment page is inserted after a chapter (song) that occupies an odd number of pages. There is one exception: when the song takes precisely two pages, it is started on a left page so I can be viewed without having to turn the page In a sequential production process the decision to insert a preceding alignment page must be made before the song is processed, yet we do not know whether the song occupies two pages until processing is complete. Two pass production is something that I'd very much like to avoid, since generating songs can be time consuming (songs may contain sections of lilypond and abc code). So the easiest approach is to process a song, insert an alignment page if the pagecount is odd, process the new song and drop the preceding alignment page if the new song occupies two pages. For the production of the songbook there are other approaches, such as to insert the alignment page before (in place) the song, after (in time) the song is processed. This may be complicated in the case where the alignment pages need to have headers and footers that belong to the (logically) preceding song. Compplicated, but doable.

If we would have a way to neatly discard a page, it would be possible to have scratch pages, pages where objects can be placed on and then maybe copied to the main PDF. For your typesetting case, this would make it possible to format a paragraph, decide whether it fits, and then put it on (copy to) the page

The suggestion to edit a PDF document read in from file is not viable, since reading/copying pages from an existing document loses outline information and link annotations (maybe all annotations, haven't checked). Also, I'm pretty sure that common resources like fonts will end up duplicate.

PhilterPaper commented 1 year ago

At the API level, the page is the smallest object that can be uniquely identified.

By default, yes. To identify any object (or part of an object) at a finer scale would require the addition of labels of some sort (to be discarded at the final write to file).

In a sequential production process the decision to insert a preceding alignment page must be made before the song is processed, yet we do not know whether the song occupies two pages until processing is complete.

Thus my earlier suggestion to look into a "rough draft" of the song to determine whether it should start on a left or right hand page. Maybe you could run the lilypond and abc and all that stuff in the first pass, and store the output in some internal format. Once you know left or right page start, you can finalize the output to the file (the PDF calls) and add any headers and footers that depend on which page. That's the way I would approach it, rather than writing a bunch of stuff and then having to delete it (even if it's all still in memory at that point).

it would be possible to have scratch pages

I have mooted this idea before, of having tentative writes to a cache of pages, which would be flexible enough to permit some rearrangement before actually being written (to the data structures, which later are actually written to file).

The suggestion to edit a PDF document read in from file is not viable, since reading/copying pages from an existing document loses outline information and link annotations (maybe all annotations, haven't checked). Also, I'm pretty sure that common resources like fonts will end up duplicate.

If you read in individual pages, such problems might arise, but reading in an entire document at once should preserve everything. (Yes, there is a known problem #186 about losing annotations, but that should be fixable.) Moving (reordering) pages may be as simple as rearranging the /Kids page list in /Pages. Deleting a page should also involve removing dependent resources (garbage collection), so it isn't as simple. A page might even be copied (duplicated), but I don't know how useful that would be, unless it's a blank page or it can be edited.

As it currently stands, a page is the lowest level that can be identified and potentially manipulated, whether in a freshly-created document or a read-in file. Anything else would require some sort of labeling (explicitly added by the user?), which would have to be excluded from the final write to file. Labels might be difficult to add to a file just read in. I can see labeling a paragraph or a line, for purpose of dealing with widows and orphans, so that a line could be moved from one page to another and line positioning be adjusted (change leading). Still, it might be easier to dummy write (tentatively write) the lines and make the decision there to move and adjust lines, rather than "writing" to the PDF data structure and then having to update it.

More thought needs to go into this before I start adding object manipulation to the API. Moving and (probably) deleting whole pages should be possible without too much fuss, but I'd like to have a consistent, integrated-looking interface to handle other facilities that might be added.

PhilterPaper commented 1 year ago

Deleting a page becomes more and more complicated as you've added more content to it. We would want to (preferably) delete all dependent (child) objects, but watch out for anything that might potentially be shared among pages. For instance, a font used just on this page ($pdf->xxxxfont()) ought to be deleted, but can we always tell who is using it? I'd prefer not to have to walk all the object links (of all other pages, including later ones) to find out which objects are exclusively used by this page. If you want to delete several pages, and/or delete a page but leave later-added pages (delete one in the middle), that might get complicated. I haven't had time to look deeply into the PDF structure to see what might be done, but I'm scaring myself with thoughts of potential problems.

An alternative is to merely remove a page from the /Kids list, and eat the wasted space of objects that are used only by this page, and ought to have been removed.

sciurius commented 1 year ago

Yes, I did some experiments and it quickly turned out to be much more complex than I imagined. AFAIC we can drop the question, even though the concept of scratch pages looks attractive.

PhilterPaper commented 1 year ago

I'll leave the issue open, in case I have a brilliant idea :-), but I don't see much progress being made any time soon. With the stuff I have on my plate, it may be next summer or fall before I can spend any serious time looking at this. If you come up with something in the meantime, I would consider accepting a PR, but you would need to thoroughly document what you're trying to do.

To be honest, I'm not sure scratch pages are all that good of an idea -- as I have suggested before, I think you're better off doing some preplanning and roughing out the page(s) before calling Builder routines to start writing data structures. If you're calling other, runtime-expensive, libraries (to get precise sizes and layout) you may not have much choice but to run them again, either way. If they write directly to the PDF, rather than just returning sizes, etc., that rather complicates things.

By the way, aren't abc and lilypond TeX-related code? You might be better off looking into using PDFTeX (or PDFLaTeX) directly to produce your output. A Perl front end to create the source to (La)TeX, doing some of the layout, might be useful? Anyway, something to consider, if you haven't already looked at it.

sciurius commented 1 year ago

I'll leave the issue open, in case I have a brilliant idea

That's fine.

I'm not sure scratch pages are all that good of an idea

I'll keep it in the back of my mind when I encounter possible use cases. And, of course, planning in advance is better.

aren't abc and lilypond TeX-related code?

The ABC processor generates (E)PS, nowadays SVG. Remember my question some time ago about including SVG images in the PDF?

BTW, I made a prototype SVG → PDF module that works in a lot of cases, but SVG is extremely complex and the creator of the ABC to SVG processor doesn't feel discouraged to use all kinds of fancy SVG features like CSS. It is a pity that the only widely used library, rSVG, still only supports a too small subset of SVG.

LilyPond has a dark past (mtex, MuTeX) where LaTeX was involved, but now it generates (E)PS natively.

PhilterPaper commented 1 year ago

SVG is on my list of priority projects (see #48 and especially #89). Equation support using MathJax will need SVG processing. If you know of a good library to handle SVG-to-PDF (vector, NOT rasterized), I'd appreciate hearing about it! At this point I don't know exactly how much of SVG will be needed, for a useful general-purpose SVG image supporter (in addition to what MathJax needs).

PhilterPaper commented 1 year ago

If you (and others) have tools that produce (E)PS, perhaps we should be thinking of adding native (E)PS image support? IIRC, PS is a programming language in and of itself, with vector graphics and some font capability, so I can't call PS quite a similar flavor to PDF, but if we're lucky it may be a reasonable translation to PDF primitives.

sciurius commented 1 year ago

Personally I've stopped producing PostScript a long time ago, producing PDF is much easier and flexible, and does not require special software/printers.

To include PostScript in PDF... well, good luck. The best PostScript engine is GhostScript, and it's a rasterizer.

sciurius commented 1 year ago

I've been told that the only way to process an SVG with all bells and whistles is to call Chrome.

PhilterPaper commented 1 year ago

Yeah, supporting PS may well be A Bridge Too Far. You'd have to have a full PS interpreter. The graphics primitives produced probably wouldn't be too bad to map to PDF primitives. I don't know about text primitives, especially those dealing with just the glyph outlines -- PDF can render (stroke) just the glyph outlines, so maybe that could be used to do some sort of clipping?

I get the feeling that (E)PS is mostly fading away, and fewer tools will output it, so it may not be worth putting a lot of effort into supporting it in PDF (as either vector or raster graphics).

PhilterPaper commented 1 year ago

Just a note that in an upcoming release of the new markup support (#185), I find the beginning of a line of text (length of $text->{' stream'}) and (when it is necessary to shift the position of a line due to font extent changes) I search for ' Tm', back up by two fields (x and y), and adjust them. It appears to work OK. So, it should be possible to update a section of a stream (not trivial!) and even to delete a section (so long as you recorded its beginning and ending offsets). Of course, if you've done something besides add to a text or graphics stream, such as start a new page, it's still up to you to deal with these side issues. Anyway, it might be a possibility to empty out all or part of text and graphic streams and write new content, rather than deleting an entire page.

It probably would have been a lot more elegant to save the text writes to an intermediate form (where coordinates could be easily adjusted) before deciding that a line was final and could be written via PDF::Builder primitives to the data (and eventually the PDF file), but I already had a lot of code that just wrote out text piecemeal, and needed to move existing text due to later line height changes.

PhilterPaper commented 4 months ago

Something necessary for good text processing, particularly with column(), is the ability to adjust a page after all or most of the content has already been output. This could cover leading changes and/or moving lines between columns needed to deal with widows and orphans, breaking a line differently to avoid breaks between tags at the end of a line, etc. Anyway, such applications should be kept in mind when designing functions to delete or move existing content.

PhilterPaper / Perl-PDF-Builder

Delete, discard, overwrite pages #189