New contributions wanted

PhilterPaper commented 6 years ago

Looking at wkHTMLtoPDF help requests, there are some ideas for additional "contrib" utilities for PDF::Builder (contrib/):

Add or replace existing page headers and footers with new ones, such as page numbering, a given date, and fixed text. It's probably too much trouble to try to extract heading/footing text from the page content, although it might be possible based on relative text sizes (assume it's a section heading). Existing page numbers might also be extracted before being overwritten (e.g., to move or reformat them). It might be better to have a human examine the page and designate where existing headers and footers are, so they can be cleanly removed before new ones are added. Chapter and section text in the replacement header/footer would probably have to be manually added per page range. Remember that TTF/OTF text will be stored by glyph ID, making it difficult (though not impossible) to extract text from existing headers and footers for reuse (if you need to read and process the text, rather than just treating it as a text blob).
Extend background, etc. to the bottom of the last page. This comes from a request to carry a body background color all the way to the end of the last page, even if the text content ends part way down. Ditto for background watermarks, images, etc., which may be incomplete on the existing last page, and have to be grabbed from a previous page.
Find presumed section headers (based on relative font size), and clean up orphans by moving some content to the next page. This would of course have a cascading effect on content further down the page. It might be better to leave selection of new page breaks to a human user.
Clean up known problems with other packages, such as wkHTMLtoPDF, such as improper splitting of tables (in the middle of a line of text, a thead immediately before a page break, etc.). If clear patterns can be discovered, such as a line at the bottom of one page repeated at the top of the next, this might be feasible, although it would be easy to go down a rabbit hole with something like this! Again, probably best marked up manually, to move the desired page break location. Don't forget table borders (outlines) would need to be reduced/expanded.
Reflowing a document to new page sizes and margins. This requires being able to recognize paragraphs. Recognition might be from inter-paragraph vspace, indentation, or short last lines, along with manual cleanup for missed cases.

Things like extracting pages and combining them into new documents might best be left to existing tools such as PDFtk, although something might be done with (manually) trimming unwanted leading and trailing content during extraction, and possibly reflowing what remains onto new pages.

PhilterPaper commented 2 years ago

Add headers, footers, and page numbers to existing PDF docs missing them, and/or use same labels on the slider thumb (pageLabel call) and outlines/bookmarks. I have a wishlist item in the RoadMap to apply the same formatted page number as produced by pageLabel to print in a header/footer and be used in outlines (bookmarks). Don't forget to find and update page cross-references within text (not just headers/footers).

PhilterPaper commented 11 months ago

This page on StackExchange Software Recommendations lists a number of requests for tools to handle various PDF tasks. If you are interested in making some project (especially if using PDF::Builder) but don't have any ideas, something there might get you going. Also in #199 I suggest the need for a PDF debugger or validator, that might use PDF::Builder.

Of course, you are perfectly free to offer (and support) any of these projects on their own (on GitHub, CPAN, or any other public repository). If they are reasonably small tools, you are welcome to donate them to the PDF::Builder repository for the contrib/ section.

PhilterPaper commented 2 months ago

Needless to say, proposals for bug fixes and enhancements, not just new 'contrib/' submissions, are always welcome! I don't have time to get to all the stuff here all by myself, and could use some help. Just remember to bounce proposals off me before engaging in a large amount of work, so you don't end up wasting your time. That's what happened to me when I submitted major fixes to PDF::API2, and the owner simply discarded them because he didn't feel like working on them (and without a word to me). This led to my forking PDF::Builder.

PhilterPaper commented 1 month ago

In the first post, I listed a number of things where "human intervention" would be best for determining where to change things in a PDF. I wonder if a PDF display engine could be combined with some sort of graphical interface to manually mark text to be moved, deleted, or edited. There are a number of text and/or graphics toolkits for Perl, but I don't think any do PDF rendering. Rather than reinvent the wheel, perhaps some sort of "overlay" interface could be dropped on top of a PDF Reader? Or, the user might render the PDF as an image, and plenty of toolkits can use an image as a background. It would "just" be a matter of correlating the image to the internal PDF structure, to see what is requested to change.

The idea is to be able to mark certain PDF content to be modified in some manner, by telling a PDF::Builder-based utility to read in the PDF and delete/change/edit/move specified content.

PhilterPaper / Perl-PDF-Builder

New contributions wanted #88