[CTS 24] Pseudo page objects

« July 07, 2017, 02:29:13 AM » by sciurius

For a complex typesetting job I would like to do the following:

create text and gfx objects
create the desired texts and graphics
establish the size (bounding box)
put the text and gfx objects onto another page

Main purpose it so decide whether the result still fits on the page, or should be moved to the next page.

My gut feeling is that this should be possible with PDF::Builder but my attempts have been unsuccessful. Any suggestions/ideas?

« July 07, 2017, 09:27:13 AM » by Phil

Well, if your markup needs are very simple (just a stream of text to be fit to a column), the paragraph and section calls may be enough. They will return any text that needs to go to the next page (but could allow a widow). Anything more complicated than that would require either

A virtual output page, where you could tenatively write text and other things to the page, and if you're happy with it (i.e., it doesn't overflow), give a command to "put ink on the page" and make the write permanent.
A test write to ask how much space something will take, and make the decision whether to write on this page or go to the next. It would be the normal processing up to the point of "writing" to the output data structure. All the processing would have to be done again for the actual write.

If a full paragraph doesn't fit, you would want to know if you can split it (without widows or orphans), which of course gets into the field of paragraph shaping. You probably would not want to rearrange text, but it might be desirable to move up an image in order to fill a hole at the bottom of a page, or move text above an image to do the same thing. Either way (with text), you're likely going to need to split a paragraph. It would be up to you to be aware of text such as "in the image above" when that text has been moved above the referenced image! Some sort of cross-reference call to output your choice of appropriate text might be nice.

It shouldn't be difficult to add a call to tell you how much space remains on the page (both in lines at current settings, and dimension such as points or cm). It might not be too bad to add a call to ask how much space (lines and dimension) a given new paragraph will take up (simple text from a string), and where (if anywhere) it can be split without introducing a widow or orphan. Here you're getting into more advanced typesetting that may not be appropriate for PDF::Builder, but should be thought about for companion packages.

« July 07, 2017, 03:10:57 PM » by sciurius

Thanks for the feedback. Unfortunately my needs are more complex than a paragraph of text.

Although both suggestions (virtual page and test write) are interesting, they involve repeating everything (on the real page and position) after the testing — precisely what I would like to avoid.

« July 07, 2017, 05:45:29 PM » by Phil

I think the virtual page method could be done without repeating everything. It would involve adding a flag to items added to objects, indicating whether this is a real write or if it's tentative. A "tentative" write, if it didn't overflow the page (or otherwise displease you), would be converted into a "real" write by changing the flag. Otherwise, you either keep adding "tentative" content, or erase it and do something else (new page, etc.). It might even be possible to change X and Y values of "tentative" content to re-position it (e.g., to stretch text baselines slightly to fill the page). That might require additional changes to the object data structure to mark what is a changeable address, and what should be kept relative to another location (e.g., when drawing you might want to change only absolute addresses, and not relative addresses).

(As an alternative), in general, would it be useful to be able to "walk" all objects on the page, and move or delete items under program control? There could be "helper" functions to change text baseline spacing in a consistent manner, etc., or scale up/down some drawing (graphics). If a paragraph has overflowed, you could even chop off the bottom and move it to the next page, or call a paragraph shaper to do a little "nip and tuck" to get the paragraph to fit (replace the existing paragraph). None of this would be trivial, of course. It would probably involve keeping extra data with the page's objects, which would be purged when it's actually written to file.

If none of this works for you, perhaps you could describe some detailed examples of what you're trying to do here. I take it you're trying to output pages without a lot of (or perhaps, any) manual intervention, so trial-and-error fitting is unacceptable.

« July 09, 2017, 04:49:17 PM » by sciurius

You're thinking too much in terms of low-level operations.

I thought it would be straightforward to just create the objects, and then move the top-object to another location on another page if necessary.

I think I'll need to rethink this a bit more.

« July 10, 2017, 11:26:18 AM » by Phil

Well, I can't think of getting much lower in operations than manually moving objects between pages... but that's an interesting idea. Would the object be unique on the page (i.e., not sharing the same $text object as everything else)? Then it could be moved as one object, rather than having to first split up an object. It might be feasible if this is done early enough in the process, before a bunch of other stuff is done that creates a lot of cross links between PDF objects (targeted to a page) and complicates things.

I've been mulling over something like this for a while, and think that "writing" to a virtual page might be best, giving the program a chance to move stuff around on a page and even between pages. You might keep the last two or three pages "written" in virtual form (sort of a VM) to ease the task of adjustment, and when the next new page is started, declare the oldest page "done" and actually write it out to the file. Something like that.

A few thoughts on how this might be implemented. cc'ing @sciurius in case he wants to comment.

PDF::Builder would keep an array of pages, configurable in new() and defaulting to three. When a new page is requested, take the oldest page in the array and write it out (normal "page" operations), clearing the content from the array/cache and permitting the space to be reused. Now, what sort of content should go in the array page? The intent is to make it easily accessible for modification (e.g., adjust the leading, move some content to the following page [usually at a new Y starting point], block move content around the page) yet efficiently written out so that the proper PDF calls can be made. At save() or saveas() (any other places?) the remainder of the cache would be written out.

What sort of data should be kept in the array pages? It can't be the final PDF objects, as we need to be able to access each item for modification (especially movement on or between pages). If you find a widow at the top of a column, you need to go to the previous page or column and reduce the leading a bit so that the line can now fit in that page or column (and physically move the line). Images and other content may need to be moved a bit. If you have an orphan at the bottom of a page or column, you could simply order it moved to the top of the next page or column, and possibly adjust the leading (and image positions) on this page. Overall paragraph shaping gets involved here, both for non-rectangular columns and appropriate word-splitting (hyphenation), per Knuth-Plass. Moving line(s) around when columns are not necessarily of constant width could be a bit of a challenge, possibly resulting in one or more reruns of text distribution calls! Multiple text and graphic objects should be allowed (and the order in which they will be presented), and specify which one you're writing to at any given time. Once the architecture is determined, "helper" functions would need to be written to perform common operations in a predictable manner, while still allowing custom operations. The idea of subpages or "minipages" should be explored, to ease the writing of inserts, footnotes, table cells, etc.; treating them as objects on the page.

We don't want too high of level data kept on a cache page, such as a string for an entire paragraph (versus the individual lines, each with their own Y coordinate). The user needs to see whether a paragraph or other content fits, which means decomposing it to at least the line level. Perhaps both versions could be kept, regenerating the lines whenever something changes? Some sort of higher-level meta language might be good here. The same language might even be read in from a file, to permit the generation of PDF pages from unformatted data prepared elsewhere. It could include Pango/HTML style markup. The issue of callbacks arises, as in how to handle certain events such as determining header and footer content to be filled in (and the user may not know the exact content until the rest of the page is finished, e.g., the final entry on a dictionary page).

Along with advanced paragraph shaping (Knuth-Plass) and other high-level functions (possibly use of Text::Layout, as well as HarfBuzz::Shaper, table, equation, and possibly simple picture drawing subsystems), page cache management might belong in a separate Perl package, calling PDF::Builder for lower-level functions. Could the entire page cache system be at a higher level, so that the user never has to directly call low-level Builder functions? The user could use the high-level functionality to write each page, and not have to worry about the current low-level calls. Any low-level calls would have to be "wrapped" in the meta language and not output to the PDF page directly.

It's beginning to sound like PDF::Builder should be left simply as a low-level interface, and a new package written over it to do high-level layout, and call PDF::Builder for production of the actual PDF. It's an open issue as to whether the page management itself and the conversion to PDF should be part of PDF::Builder, or part of the new package. If someone has an interest in working on this, please don't go off and do it on your own, surprising me with the finished product! I would like to have input on how you architect this new package, so that I can be assured that everything is being covered.

In addition to the eqn, tbl, and pic work-alike modules; 2D and 3D graphing modules would be a good addition. Possibly something freely usable already exists in the Perl world, that could be adapted to output PDF (or at least, the higher level discussed here). Note that for tbl, there is already a PDF::Table module, but I'm not sure how adaptable that would be to the new architecture. It wants to write the PDF directly (currently text-only) rather than returning high-level content to PDF::Builder. At least it might supply some algorithms for sizing table cells, with "minipages" used for filling arbitrary content. For eqn, there might be two separate flavors, one accepting (La)TeX style input and the other troff style, although it might be best to come up with a new eqn language blending the best of the two.

Ref: « July 07, 2017, 09:27:13 AM » by Phil, the notion of modifying text "on the fly" to indicate relative placement is an interesting one, especially if objects may be floated around to best fill page space (text, images, equations, tables, etc.). It may not be safe to hard code "see Fig. 12 above" if there's a chance that the referencing text could be moved above the figure (and revised to "see Fig. 12 below").

As a more general facility, you would want to be able to internally label some object, and then select automatically from "above|below", "facing|next|previous page", and perhaps others. Of course, inserting different text strings could result in different space being taken, interfering with attempts to move text! Thought needs to be given to a general cross-reference facility ("See xxxx on page nnn|above|below|facing page|previous page|next page, as well as for index, footnote/chapter note/end note, bibliography use, etc. ) and automatic numbering of objects, including entire lists (another module to support: generation of simple, ordered, unordered, definition, and other kinds of lists, including the ability to interrupt and resume a list with page breaks or full paragraphs in-between). This would require keeping document global lists of object page and Y coordinates.

Then there's the whole issue of marking text (or other objects) to be floated up or down within range limits, to most completely fill a page, rather than leaving gaping holes where a picture didn't quite fit, and flowed to the next page.

Finally, is there any point in writing this "higher level layout" package to be independent of the output format? Other than PDF, is there any fairly widely-used format that might need such assistance? LaTeX already does a pretty good job of this, and troff isn't that widely used. MS Word (.doc, .docx) probably wouldn't work that well, as it's likely to reformat on its own (the whole point is to fix the layout in this package, and just leave the low-level rendering to someone else).

If we can reliably find all text and graphics positioning commands on an already-written page (but not yet actually written to file), it may be feasible to move around content on a page, and even between pages. I already do this a little in column() in order to place underlines, etc. in the right place. Moving content between pages is a lot harder, as it is more than just updating x and y locations; it is creating new parents (if necessary), resources such as opened fonts, and the like. It is also desirable (if not feasible) to remove no-longer needed resources from a page, (or at least, move them to another page where they're needed).

Generic support for moving content within and across pages might not be any harder than writing out virtual pages, if keeping track of where stuff is on the page isn't too bad. In column(), I keep track of the offset of the [under]lined text's start in the text stream -- perhaps a more formal way of doing that could be cobbled up. Presumably, everything after the start of the content stream subsection will be moved by some delta-x and delta-y, although either could vary (e.g., for a leading change).

Something to mull over.

One model for a "virtual page" approach would be that the entire document might be "written" as one page, with each object labeled and relocatable to any position on any page. You do need to be careful about page-dependent stuff, such as whether to split a table across pages, or widows and orphans, or column width changes at a page break. Parents and any needed resource objects could be created on the fly. This may be such a different model for the program that it should be an entirely new product! Incidentally, I would probably combine text and graphics into one stream, rather than continuing to wrangle separate text and graphics streams.

There are a number of operations that could conceivably be done against content already "written" to a page (but not yet to file). These include

deleting all or part of text and/or graphics object content That part isn't too hard, but related objects (e.g., new fonts and other resources) would have to be kept, unless there is an easy way to figure out if something is no longer referenced. Or, the user could manually delete them in some way. Dangling references will always be a threat.
moving content within a page Anything that is absolutely positioned (e.g., Tm operator) could have a delta-x and delta-y added to them. This could be all or part of an already output content stream. A delta-y (or delta-x) could vary proportionately along the Y axis, so that leading could be increased or decreased. Note that if non-rectangular columns are being output, that floats (images, etc.) would have to be repositioned, too. If decreasing leading (e.g., to pull in a widow), it is possible that an indented space in the side of a paragraph may now be too short to fit the image. Possibly an image's height could thus be reduced, avoiding having to reflow the paragraph to force an additional short line.
moving or copying content between pages This could be very tricky, as additional content such as a font or other resource might need to be copied over to the target page. We might not know that it's safe to delete it from the source page, or even what needs to be copied. To make room on the target page, existing content may have to be repositioned on a massive scale (e.g., increasing leading to push an orphan to the next page, or decreasing leading to pull in a widow from the following page). If the line length is different between pages for some reason, things could get very messy.

Some of these things might be better handled in column() by looking at the original output, and adding markup to adjust the output for a second run. Of course, the effects of a change can cascade down all following pages, necessitating multiple reruns of the document! To fully automate such things, would probably require additional data to be written to each page and object on the page, to be ignored when outputting the PDF. This could include lists of resources used by each object (including paragraphs); reference counts for resources, etc., to tell if they are no longer used; high level positioning and paragraph shapes; and so on. We would like to avoid doing a full rerun of a page, but a few partial reruns may be unavoidable. Anyway, yet more to mull over.

PhilterPaper / Perl-PDF-Builder

[CTS 24] Pseudo page objects #95