Further enhancements to markup (column() method)

The 3.025 release, containing column(), is far from the end of the job! I don't know how quickly I'll be able to get to the things on this list, but I'd like to do them all. PRs to take care of individual items would be appreciated, of course (and sponsorship of new features would be even better!). Most of these items can be implemented in any order, so don't be afraid to request priority for some things, or to help out.

Items flagged with a star * are not official HTML or CSS, but would be extensions.

Perhaps in the 3.027 release?

(I had hoped to get at least some of these in 3.026, but they didn't make it due to fixing other stuff. Hopefully 3.027.)

Tolerance of mixed case tag names (currently only lowercase HTML tags are supported, e.g., <P> would fail).
Flag unknown HTML tags and CSS properties, so at least you know they're not being handled as expected
<img>, including SVG, support
whitespace control per CSS (<pre>, <nobr>, <br>, <code> `)
Handle &NBSP; like space, but force NOBREAK.
<cite>, <q>, <kbd>, <samp>, <var> font control.
interfaces to <eqn>* equation setting (via MathJax) and data plotting (via GnuPlot)
Arbitrary paragraph shapes ('path'), not just simple rectangles. These would include not only rectangles with inserts/cutouts, but arbitrary shapes including arcs and splines in addition to non-orthogonal straight lines. Looking to SVG's path tag for inspiration. This determines only the extent of the base line at a given y offset; nothing is done about characters "poking" through the edge of the column, as this would require detailed examination of every glyph.
Definition lists with <dl>, <dt>, and <dd>.
CSS list-style-position: inside (currently only outside supported).
Use of (at a minimum) hyphenate-basic facility including &SHY;. See also full Knuth-Liang below.
<hr> align, <sup> and <sub>.
<center>* formatting control.
<big>*, <bigger>*, <smaller>*, <small> font sizing.
CSS _expand* to call hscale() and/or condensed/expanded type added to get_font(), with possibly synfont() usage in Font Manager.
CSS text-transform, such as uppercase and lowercase flavors.
CSS em and ex sizes relative to current font size (like %). Other absolute sizes such as in, cm, mm, and perhaps px.

Possibly...

<base>, <wbr>. Note that <abbr> is dynamic in HTML (on mouseover), so this might not be feasible with PDF.
<abstract>*, <article>, <aside>, <section>, etc. as predefined page areas?

Extensions to HTML and CSS...

<sl>* simple list (like <ul>, but no markers).
<sc>* (Small caps) and CSS font-variant: small-caps preprocess: around runs of lowercase put <span style="font-size: 80%; expand: 110%"> and fold to UPPER CASE. This would be after @mytext creation, inserting a series of <span> tags.
<pc>* (Petite caps) and CSS font-variant: petite-caps* like <sc>, but 1ex font-size, expand 120%.
<dc>* (Drop caps). Besides the giant single (usually) letter, we need to indent multiple lines.
<ovl>* overline (similar to underline, for completeness) using CSS text-decoration: overline.
<k>* kern text (shift left or right) with CSS _kern*, or general positioning, e.g., to form a logo such as (La)TeX through character positioning. What to do at the HTML level? x +/- % of font size, y +/- % of font size. To do effects such as notations like <sup>4</sup><sub>2</sub>He, perhaps <hkeep align="right"><sup>4</sup><sub>2</sub></hkeep>He notation, or a more general purpose <vstack> tag? Remember to keep "He" (in this example) with it as an unsplittable unit. Possibly, use of <eqn> will do the job instead.
<vfrac>* vulgar fraction m/n, using <sup>, <sub>, and kern <vfrac num="1" denom="2">. Some of these things may overlap too much with <eqn> processing (see below) to be worth doing separately.
HTML attributes to tune (force end) of something, such as early </sc> implied after the earlier of X words or the end of a line (with a complete, unhyphenated last word). Something like <sc eol="end">In The Beginning</sc> was inspired by baseball. If a complete "Beginning" cannot be fit on the line, Small Caps would end at "The".
<endc>* force early end of column here (at this y, while still filling line), e.g., to prevent a widow. Optional conditional (e.g., less than 1" of vertical space left in column). By default, forbid hyphenation, since this is at the end of a column.
<vkeep>* material to keep together vertically, such as headings and paragraph text.
leading (line-height) as a dimension instead of a ratio, convert to ratio before storing.
HTML-like tags to set various settings (e.g., leading), without their own tags.

Release 3.028 or later?

left/right "auto" margins? <center> and CSS text-align: center may need this. Since a given line might be a composite of several sub-lines, this might best be done by writing out the full line, and then backpatching x displacements.
Knuth-Liang hyphenation. I may write a new package based on Text::Hyphen, to permit on-the-fly change of language and easy update of language files. It would be used by any Knuth-Plass code (possibly Text::KnuthPlass) to paragraph shape. It would need <lang> tag to mark a section with a different language with different hyphenation rules (might integrate with bidirectional flag).
<lang>* define language of a span of text, for hyphenation or audio purposes. Possibly also as a new attribute for other tags such as <span>. Need to see what HTML and CSS already define.
<hyp>*, <nohyp>* control hypenation in a word (and remember rules when see this word again) -- basically, give dictionary entry for splitting up some word, overruling whatever Knuth-Liang would do.
Knuth-Plass paragraph shaping (with proper hyphenation). If use Text::KnuthPlass, this will require a major update to that package. Possibly try either a modified Text::KnuthPlass library, or implement a "semi-greedy" line-filling algorithm.
HarfBuzz::Shaper for ligatures, callout of specific glyphs (not entities for swashes, alternate presentation forms, etc.), RTL and non-Western language support. <bdi> and <bdo> support. This may require a wrapper around HarfBuzz::Shaper, unless its owner is willing to extend it.
<nolig>* and </nolig>* forbid ligatures in this range of text.
<lig gid='nnn'>* and </lig>* replace character(s) in child text by a ligature, if available. Highly font-dependent.
<alt gid='nnn'>* and </alt>* replace character(s) by alternate glyph such as a swash or alternate presentation form. Highly font-dependent.

This list is not set in stone, and there are no guarantees that I'll ever get to all of them. As I said, help with this would be much appreciated. Also, further features that you would find useful could be suggested.

Thanks Phil, I think if I had to pick one then this would be it:

<pre>, <nobr>, <br>, <center>* formatting control

I should probably list here what's in the current (3.025) release for Markdown (md1) and HTML (html) support, so you don't have to go searching through the code:

Current HTML/Markdown supported

<i> and <em> tags (Markdown _, *) as italic font style
<b> and <strong> tags (Markdown **) as bold font weight
<p> tag (Markdown empty line) as a paragraph
<font face="font-family" color="color" size="font-size"> as selecting face, color and size
<span> needs style= attribute with CSS to do anything useful
<ul> tag (Markdown -) unordered (bulleted) list. type to override marker supported
<ol> tag (Markdown 1.) ordered (numbered) list. start and type supported.
<li> tag list item. value to override ordered list counter, and type to override marker type supported
<a href="URL"> tag (Markdown []() ) anchor/link, web page URL or this document target #p[-x-y[-z]]
<h1> through <h6> tags (Markdown # through ######) headings
<hr width="length" size="length"> tag (Markdown ---) horizontal rule. currently no align property (left only)
<s>, <strike>, <del> tags (Markdown ~~) text line-through
<u>, <ins> tags text underline
<blockquote> tag (Markdown >) indented both sides block of smaller text

Numbered (decimal and hexadecimal) entities are supported, as well as named entities (e.g., —). Both lists get a "gutter" (for the marker) of marker_width points wide, so it's consistent over the call.

Current CSS supported

Note that the default CSS applies to Markdown, unless you give a style => entry to the column() call to revise the CSS.

In HTML, you can define <style> tags, but caution: these are pulled out into a global style block (cumulative and global, as though they had all been given in the <head>), applied after the CSS property defaults are defined and then any column() global style => 'CSS list' has been applied.

CSS Selectors are very primitive:: a simple tag name (including body), such as ol; a class name such as .error; or an ID such as #myID. There are no hierarchies or combinations supported (e.g., nothing like p.abstract or li > p). The (decreasing) order of precedence follows a browser's: in a style = attribute, as a tag attribute (which may have a different name from the CSS's), an ID, a class, or a tag name. Comments / and / are NOT currently supported in CSS.

color (foreground color, in standard PDF::Builder formats)
display (inline or block)
font-family (as defined to Font Manager, e.g., Times)
font-size (n points, npt, n% of current font size. more units in future)
font-style (normal or italic)
font-weight (normal or bold)
height (n points or npt, thickness/size of horizontal rule ONLY)
list-style-position (outside or inside, currently only outside supported)
list-style-type (marker description, per standard CSS, plus "box" for unordered list)
margin-top/right/bottom/left (per standard CSS. combined margin in the future)
_marker-before (extension: text to insert before ordered list marker)
_marker-after (extension: text to insert after ordered list marker)
text-decoration (per standard CSS)
text-height (change leading, ratio of baseline-to-baseline to font size. future: set as a length or % of font size)
text-indent (paragraph etc. indentation, n points, npt, n% of font size)
width (n point or, npt, width of horizontal rule ONLY)

Global Settings

There are a number of global settings either required or available for tuning the behavior of column(). In the parameter list you can set

font_size = default initial font size (points) to be used, but can be overridden by CSS or <font size>. Initially 12.
leading = default leading (text-height) ratio. Initially 1.125.
marker_width = points, set width of gutter where a list's marker goes. Initially 2 * <font size>.
para = list of indentation (text-indent) and inter-paragraph spacing (margin-top), both in points. These are the defaults for all formatting modes, unless overridden by a style => entry. Initially [ <font size>, 0 ].
color = initial text and graphics color setting, in standard PDF::Builder formats. Initially 'black'.
style = CSS declarations to be applied after CSS properties initialization and before any global <style> tags, Initially ''.

That's all I can think of at the moment. Remember that the Markdown converter (Text::Markdown) may produce HTML that this system cannot yet handle. And of course, there's plenty of HTML and CSS that can not now (and may never be) handled, but you can certainly request support.

My intent is to keep this list synchronized with Docs.pm (POD) and the Content::Text POD.

Some more ideas I'm kicking around...

A 'pod' markup format for Perl POD documentation (another flavor of Markdown). This might use something like pod2html or the equivalent CPAN library to convert the POD in a Perl file to HTML, and then process it in the usual way. Problems are that links within the HTML will be to HTML files, and need to be converted to PDF file destinations, and a user may want to add navigation links (see docs/buildDoc.pl) before conversion. So, it may be better to convert your POD to HTML externally, and just treat the resulting HTML like any other.
Man-page and/or troff flavor input. The two main problems are that there does not seem to be a Perl library to convert man/troff to HTML (i.e., I would have to write my own), and I can't see going through the effort to also support eqn, tbl, grap, pic, and whatever other specialized input. Any man/troff processing would have to forgo those specialized processors. How much demand would there be for man/troff capability? There would have to be quite a bit to make it worthwhile. There are apparently some external utilities to do this conversion (to HTML), so that may be a better bet.
Currently, links are either to a web page or to a (manually entered) page/x/y/zoom within this document. I'm thinking about ways to recognize links to external PDF documents. It might be as simple as looking at a file extension (before any #id label) to see if it's ".pdf".

Within a target document, currently links can go only to a manually specified page/x/y. It would be nice to be able to go to an id of some sort, such as an HTML link can go to an #id URL. This will require resolving an id to a page and x/y, which means at least two passes, at least if the page number needs to be part of the link text. This should be considered as part of a more general TOC/index/cross reference/footnote/index system. Some Markdown flavors create a rather long and clumsy id for each heading -- I'm not sure if there's a way to specify an id= someplace in a Markdown document like you can in HTML. I'm not sure I even want to think about cross-document targets (might mean generating all documents in one go!).

PDF::Builder also supports Bookmarks/Table of Contents/Outlines (varies by Reader) which should be smoothly incorporated into this (as well as Page Labels and on-paper matching page numbering).

Markup could use some (non-standard) HTML tags to

define book, chapter, section, subsection, etc. partitioning. A chapter might need tags to skip to next page top, skip to right-hand page (usually), etc. A section might need tags to use a dropped cap and small-cap some part of the first sentence. The issue comes up whether some of these complicated actions should be a sequence of tags, or some sort of subroutine. For example, for a chapter start, a global chapter counter needs to be maintained, and if a page is completely skipped (blank), should any page number or header be written on it (things a user would like to configure). A skip to a right-hand page assumes you are doing book-style left/right pages.
define page sections such as asides, margin notes, and column insets. These could all be mini-columns, but would have to be output (or at least, sized) first in order to bend the main column(s) around them.
define footnotes as physical page areas (automatically sized), if you intend to print in book style, or perhaps some sort of link if the primary use is online.
handle left- and right-hand pages (as in a bound book), with "inside" and "outside" location of headings, chapter titles, etc. rather than hard coding "left" and "right".
adequately "keep" text together and page-break at optimal locations. This includes one or more headings being included along with the first two lines of the paragraph in orphan-prevention. It may require virtual page output or the ability to move line(s) to another column or even the next page, or at least, be able to erase output already written and push it back onto the input queue. This can be quite complicated if you start a new page and only then realize that you have a widow.

I doubt that PDF::Builder's column markup will ever be able to handle a full HTML input, à la "Prince", nor Javascript, but if it can handle a large subset of HTML, we could end up with a decent general-purpose HTML-to-PDF converter. Then, with additional extensions to HTML, we can do a decent book layout. Some actions might be simplified by allowing Perl routines to be defined, rather than defining complicated tags with many options. Also flag any unsupported HTML tag or CSS that it encounters, to alert users that something wasn't processed.

Looking through the troff manual, I get the idea for the following CSS:

_heading_prefix to set additional text to be prepended to heading string
_heading_suffix to set additional text to be appended to heading string
_suppress_nl to suppress the next newline (display block level changed to inline) and allow embedded text, such as headings in a "run in" or "let in" manner

Example:

<h5>This is a level 5 heading</h5>
<p>This is the paragraph it pertains to...</p>

would produce

This is a level 5 heading: This is the paragraph it pertains to...

Assuming the CSS for <h5> includes _heading_suffix: ': ' _suppress_nl, and is bold at the same font-size. Note that an extra space will be automatically added.

A more general _content_before and _content_after, or even CSS content with some form of before and after (similar function to the ::before and ::after psuedo-elements in CSS) might be better. This could be used with marker before and after (e.g., put parentheses around an ordered list counter) and the marker-specific _marker-before and _marker-after could be phased out (deprecated). However, this would require separate handling for the marker and list item text, so it may be better to keep explicit "marker" versions, unless there is a clean way to group "marker" CSS together and keep it separate from what's applied to the list item text. Caution: do NOT inherit any sort of _content_before and _content_after, as you want to strictly control what it applies to.

Eventually, heading text/level and any explicit id's will be collected for generating a Table of Contents, cross-references, etc.

In the same manner as specifying that the next line-end be suppressed, we could specify the suppression of any paragraph indentation. This way, with a section heading (not run-in in this case) could cause the first paragraph in the section to not be indented, rather than having to do explicit markup for this.

After further perusal of the troff manual, I see a number of items that could be useful in typesetting. While troff is a very powerful system, including facilities for tables, graphs, equations, picture drawing, etc., I don't see it being used very much "in the wild". Therefore, I do not plan at this time to implement troff input (nor its close cousin, man page input, although that is somewhat more widely used). There does not seem to be any Perl module to translate troff into something else, such as HTML, although groff with HTML output (grohtml) may be a viable alternative. If some party would find troff input very useful for PDF::Builder, they are welcome to sponsor such work, or build a good troff-to-HTML translator package that could be used here (possibly with SVG output for pic, eqn, grap, etc.). Note that SVG support (as well as eqn-to-SVG) is already in plan for PDF::Builder.

That said, here are some additional features for column(), many inspired by troff, that could be implemented:

Full alignment capabiity for any block-level object: left, centered, right; inside and outside; left or right, with +/- offset. troff has a "display" (.DS) feature to treat an object or block as unsplittable and locatable on the the page (centered, etc.). For example, you can have short, left-justified text within a block which is itself centered on the page. Perhaps some sort of <block> HTML extension, or some manner of "subcolumn" within the column() system? Implementing a <div> might be even better, with a flag to indicate whether it can be split across pages (default "no").
"inside" and "outside +/- offset plus support for even/odd pages (binding width on right/left margin). Make sure that bidirectional text (RTL) works -- are "left" and "right" real locations or virtual (depending on LTR or RTL)? It would be preferable to stay compatible with HTML and CSS standards.
Heading, paragraph, and possibly line numbering, with various numeric formats (including fixed length decimal integers with leading 0's). One difficulty with this is that column() input may be applied in short segments, requiring the author to keep track of the starting number for each type (unless we keep global counters outside of column()).
Footnotes (and accumulated endnotes for end-of-chapter) of column or page width. Perhaps start writing a footnote as soon as it is encountered, reducing the column height as needed, and moving already-written footnotes up the page (increase y) to make room for the new one. If the footnotes stack top (column bottom) bumps into the callout line, a footnote can be split to the next column, although if there is not enough vertical space to even start the footnote, that is a difficult problem.
Sometimes there is a need to invoke code within text, such as for conversions. E.g., "Slide Mountain is elevation(4218) tall." would output "Slide Mountain is 4218 ft (1286 m) tall." The best way to do this is probably to embed Perl code within the source string, keeping all the user-supplied math, etc. outside of column(): "Slide Mountain is ".elevation(4218)." tall.", which converts and rounds/formats appropriately.
troff permits the definition of strings, as macros, but it may be cleaner just to embed or concatenate a Perl $var in the source.
Would vertical and horizontal "skips" (of fixed distance) be useful for anything? See also tabbing. We already have top and bottom margins, and some left and right margins.
When white-space is set appropriately (to honor spacing), tabs can be handled. The author will want to be able to specify the tab stops from the left edge (min x) of the column. If using constant width text, character count will work, but for proportional fonts, absolute distance measures (in, cm, mm, pc (Pica, 1/6 in), pt (Big or PostScript point, 1/72 in), pp (Printer's Point, < 1/72 in, although there is no worldwide standard size), el (Elite, 1/12 in) and more (em, en, ex) should be usable. Of course, such units should be usable wherever distances are given. Note that tabbing is often better done with a table, if the intent is to create subcolumns of text. For tables, add an "alignment" setting so that items within a cell can be aligned on, e.g., the decimal point. See CSS tab-space, and default tabstops of every 8 characters.
Turn "fill mode" on and off. Does this need to be separate from "white-space"?
Ability to give "unpaddable" spaces, whose length cannot be changed when justifying a line. When inline equations are supported, this may not be that useful (most use of fixed-width spaces would be within equations). Also make it easy to specify unbreakable spaces, such as within a name, and perhaps various specialty spaces (thinsp, quadspace, etc.).
Allow combined leading and font size, e.g., "font-size: 14/12" (12 pt text with 14 pt leading).
Make sure that unordered (bulleted) lists can easily specify a custom marker for all items, or a single item.
Allow hanging indent on a list, where the first line is at the normal point, and following lines are further indented.
Allow "glosses" (explanatory text) between lines or in the margin. "Ruby" for CJK languages may be something similar (I think it's some sort of pronunciation guide text).
Allow paragraph alignment where adjacent columns (e.g., text and one or more translations) align their paragraph starts, leaving extra vertical gaps in one or the other column. One way to do this would be to do one paragraph at a time, in separate columns at a time, and pick the lowest y to start the next paragraph across all the columns.

CSS "white-space" with settings:

normal collapse whitespace, including line-ends, into single blanks; implies fill mode
pre preserve all whitespace, honor line-ends
nowrap collapse whitespace, including line-ends, into single blanks; implies fill mode; break lines only on <br>
pre-line collapse whitespace (except line-ends) to single blanks; auto-break as necessary to avoid overflow; honor line-ends
pre-wrap like pre, but add auto-break to avoid overflow

A library to consider for further use: Text::Markup. Note that it does not natively translate many other formats, but appears to be a front-end for a lot of other translator packages. If nothing else, it is a pointer to useful other packages.

Here are the formats currently supported:

Asciidoc
BBcode (uses Parse::BBCode)
Creole (uses Text::WikiCreole)
HTML
Markdown (uses Text::Markdown)
MultiMarkdown (uses Text::MultiMarkdown)
MediaWiki (uses Text::MediawikiFormat)
Pod (uses Pod::Simple::XHTML)
reStructuredText (uses rst2html_lenient.py, requires Python)
Textile (uses Text::Textile)
Trac (uses Text::Trac)

The only major formats missing from the list are troff and the closely related nroff/man, and of course, LaTeX. As I said earlier, groff may be suitable to convert the former to HTML. I don't see any point in converting (La)TeX to HTML, as excellent PDF output already exists for this family of markup (there are also good HTML output converters, should you want to embed some LaTeX-source documentation within a larger PDF). Pod conversion makes use of Pod::Simple::XHTML, which is already recommended for use in building HTML documentation for PDF::Builder (called by docs/buildDoc.pl).

I can't see at this point, it being worthwhile to directly support any of these, or Text::Markup itself, as few would make good document inputs. You might want to embed the documentation for some program (as Pod, man, etc.), so there's certainly nothing to prevent you from running a converter (e.g., Text::Markup) externally and then dealing with the resulting HTML, which column() can import. Don't forget to do something about internal links, which may be expecting to link to HTML documents, if those target documents have also been converted to PDF.

In addition to <eqn> (display and inline), consider a form of those tailored for chemistry (<chem> and <dchem>). This would use the MathJax markup after modifying the user input source to use Roman/normal weight for text. It would need to be an improvement over explicitly using super- and sub-scripts to be worth the effort.

PhilterPaper / Perl-PDF-Builder