PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

Further enhancements to markup (column() method) #195

Open PhilterPaper opened 1 year ago

PhilterPaper commented 1 year ago

The 3.025 release, containing column(), is far from the end of the job! I don't know how quickly I'll be able to get to the things on this list, but I'd like to do them all. PRs to take care of individual items would be appreciated, of course (and sponsorship of new features would be even better!). Most of these items can be implemented in any order, so don't be afraid to request priority for some things, or to help out.

Items flagged with a star * are not official HTML or CSS, but would be extensions.

Perhaps in the 3.027 release?

(I had hoped to get at least some of these in 3.026, but they didn't make it due to fixing other stuff. Hopefully 3.027.)

Possibly...

Extensions to HTML and CSS...

Release 3.028 or later?

This list is not set in stone, and there are no guarantees that I'll ever get to all of them. As I said, help with this would be much appreciated. Also, further features that you would find useful could be suggested.

abeverley commented 1 year ago

Thanks Phil, I think if I had to pick one then this would be it:

<pre>, <nobr>, <br>, <center>* formatting control

PhilterPaper commented 1 year ago

I should probably list here what's in the current (3.025) release for Markdown (md1) and HTML (html) support, so you don't have to go searching through the code:

Current HTML/Markdown supported

Numbered (decimal and hexadecimal) entities are supported, as well as named entities (e.g., &mdash;). Both lists get a "gutter" (for the marker) of marker_width points wide, so it's consistent over the call.

Current CSS supported

Note that the default CSS applies to Markdown, unless you give a style => entry to the column() call to revise the CSS.

In HTML, you can define <style> tags, but caution: these are pulled out into a global style block (cumulative and global, as though they had all been given in the <head>), applied after the CSS property defaults are defined and then any column() global style => 'CSS list' has been applied.

CSS Selectors are very primitive:: a simple tag name (including body), such as ol; a class name such as .error; or an ID such as #myID. There are no hierarchies or combinations supported (e.g., nothing like p.abstract or li > p). The (decreasing) order of precedence follows a browser's: in a style = attribute, as a tag attribute (which may have a different name from the CSS's), an ID, a class, or a tag name. Comments / and / are NOT currently supported in CSS.

Global Settings

There are a number of global settings either required or available for tuning the behavior of column(). In the parameter list you can set

That's all I can think of at the moment. Remember that the Markdown converter (Text::Markdown) may produce HTML that this system cannot yet handle. And of course, there's plenty of HTML and CSS that can not now (and may never be) handled, but you can certainly request support.

My intent is to keep this list synchronized with Docs.pm (POD) and the Content::Text POD.

PhilterPaper commented 1 year ago

Some more ideas I'm kicking around...

  1. A 'pod' markup format for Perl POD documentation (another flavor of Markdown). This might use something like pod2html or the equivalent CPAN library to convert the POD in a Perl file to HTML, and then process it in the usual way. Problems are that links within the HTML will be to HTML files, and need to be converted to PDF file destinations, and a user may want to add navigation links (see docs/buildDoc.pl) before conversion. So, it may be better to convert your POD to HTML externally, and just treat the resulting HTML like any other.
  2. Man-page and/or troff flavor input. The two main problems are that there does not seem to be a Perl library to convert man/troff to HTML (i.e., I would have to write my own), and I can't see going through the effort to also support eqn, tbl, grap, pic, and whatever other specialized input. Any man/troff processing would have to forgo those specialized processors. How much demand would there be for man/troff capability? There would have to be quite a bit to make it worthwhile. There are apparently some external utilities to do this conversion (to HTML), so that may be a better bet.
  3. Currently, links are either to a web page or to a (manually entered) page/x/y/zoom within this document. I'm thinking about ways to recognize links to external PDF documents. It might be as simple as looking at a file extension (before any #id label) to see if it's ".pdf".
  4. Within a target document, currently links can go only to a manually specified page/x/y. It would be nice to be able to go to an id of some sort, such as an HTML link can go to an #id URL. This will require resolving an id to a page and x/y, which means at least two passes, at least if the page number needs to be part of the link text. This should be considered as part of a more general TOC/index/cross reference/footnote/index system. Some Markdown flavors create a rather long and clumsy id for each heading -- I'm not sure if there's a way to specify an id= someplace in a Markdown document like you can in HTML. I'm not sure I even want to think about cross-document targets (might mean generating all documents in one go!).
  5. PDF::Builder also supports Bookmarks/Table of Contents/Outlines (varies by Reader) which should be smoothly incorporated into this (as well as Page Labels and on-paper matching page numbering).
PhilterPaper commented 1 year ago

Markup could use some (non-standard) HTML tags to

  1. define book, chapter, section, subsection, etc. partitioning. A chapter might need tags to skip to next page top, skip to right-hand page (usually), etc. A section might need tags to use a dropped cap and small-cap some part of the first sentence. The issue comes up whether some of these complicated actions should be a sequence of tags, or some sort of subroutine. For example, for a chapter start, a global chapter counter needs to be maintained, and if a page is completely skipped (blank), should any page number or header be written on it (things a user would like to configure). A skip to a right-hand page assumes you are doing book-style left/right pages.
  2. define page sections such as asides, margin notes, and column insets. These could all be mini-columns, but would have to be output (or at least, sized) first in order to bend the main column(s) around them.
  3. define footnotes as physical page areas (automatically sized), if you intend to print in book style, or perhaps some sort of link if the primary use is online.
  4. handle left- and right-hand pages (as in a bound book), with "inside" and "outside" location of headings, chapter titles, etc. rather than hard coding "left" and "right".
  5. adequately "keep" text together and page-break at optimal locations. This includes one or more headings being included along with the first two lines of the paragraph in orphan-prevention. It may require virtual page output or the ability to move line(s) to another column or even the next page, or at least, be able to erase output already written and push it back onto the input queue. This can be quite complicated if you start a new page and only then realize that you have a widow.

I doubt that PDF::Builder's column markup will ever be able to handle a full HTML input, à la "Prince", nor Javascript, but if it can handle a large subset of HTML, we could end up with a decent general-purpose HTML-to-PDF converter. Then, with additional extensions to HTML, we can do a decent book layout. Some actions might be simplified by allowing Perl routines to be defined, rather than defining complicated tags with many options. Also flag any unsupported HTML tag or CSS that it encounters, to alert users that something wasn't processed.

PhilterPaper commented 1 year ago

Looking through the troff manual, I get the idea for the following CSS:

  1. _heading_prefix to set additional text to be prepended to heading string
  2. _heading_suffix to set additional text to be appended to heading string
  3. _suppress_nl to suppress the next newline (display block level changed to inline) and allow embedded text, such as headings in a "run in" or "let in" manner

Example:

<h5>This is a level 5 heading</h5>
<p>This is the paragraph it pertains to...</p>

would produce

This is a level 5 heading: This is the paragraph it pertains to...

Assuming the CSS for <h5> includes _heading_suffix: ': ' _suppress_nl, and is bold at the same font-size. Note that an extra space will be automatically added.

A more general _content_before and _content_after, or even CSS content with some form of before and after (similar function to the ::before and ::after psuedo-elements in CSS) might be better. This could be used with marker before and after (e.g., put parentheses around an ordered list counter) and the marker-specific _marker-before and _marker-after could be phased out (deprecated). However, this would require separate handling for the marker and list item text, so it may be better to keep explicit "marker" versions, unless there is a clean way to group "marker" CSS together and keep it separate from what's applied to the list item text. Caution: do NOT inherit any sort of _content_before and _content_after, as you want to strictly control what it applies to.

Eventually, heading text/level and any explicit id's will be collected for generating a Table of Contents, cross-references, etc.

In the same manner as specifying that the next line-end be suppressed, we could specify the suppression of any paragraph indentation. This way, with a section heading (not run-in in this case) could cause the first paragraph in the section to not be indented, rather than having to do explicit markup for this.

PhilterPaper commented 1 year ago

After further perusal of the troff manual, I see a number of items that could be useful in typesetting. While troff is a very powerful system, including facilities for tables, graphs, equations, picture drawing, etc., I don't see it being used very much "in the wild". Therefore, I do not plan at this time to implement troff input (nor its close cousin, man page input, although that is somewhat more widely used). There does not seem to be any Perl module to translate troff into something else, such as HTML, although groff with HTML output (grohtml) may be a viable alternative. If some party would find troff input very useful for PDF::Builder, they are welcome to sponsor such work, or build a good troff-to-HTML translator package that could be used here (possibly with SVG output for pic, eqn, grap, etc.). Note that SVG support (as well as eqn-to-SVG) is already in plan for PDF::Builder.

That said, here are some additional features for column(), many inspired by troff, that could be implemented:

CSS "white-space" with settings:

PhilterPaper commented 1 year ago

A library to consider for further use: Text::Markup. Note that it does not natively translate many other formats, but appears to be a front-end for a lot of other translator packages. If nothing else, it is a pointer to useful other packages.

Here are the formats currently supported:

The only major formats missing from the list are troff and the closely related nroff/man, and of course, LaTeX. As I said earlier, groff may be suitable to convert the former to HTML. I don't see any point in converting (La)TeX to HTML, as excellent PDF output already exists for this family of markup (there are also good HTML output converters, should you want to embed some LaTeX-source documentation within a larger PDF). Pod conversion makes use of Pod::Simple::XHTML, which is already recommended for use in building HTML documentation for PDF::Builder (called by docs/buildDoc.pl).

I can't see at this point, it being worthwhile to directly support any of these, or Text::Markup itself, as few would make good document inputs. You might want to embed the documentation for some program (as Pod, man, etc.), so there's certainly nothing to prevent you from running a converter (e.g., Text::Markup) externally and then dealing with the resulting HTML, which column() can import. Don't forget to do something about internal links, which may be expecting to link to HTML documents, if those target documents have also been converted to PDF.

PhilterPaper commented 2 weeks ago

In addition to <eqn> (display and inline), consider a form of those tailored for chemistry (<chem> and <dchem>). This would use the MathJax markup after modifying the user input source to use Roman/normal weight for text. It would need to be an improvement over explicitly using super- and sub-scripts to be worth the effort.