flyingsaucerproject / flyingsaucer

XML/XHTML and CSS 2.1 renderer in pure Java
Other
1.99k stars 556 forks source link

simplify API for creating PDF #277

Closed asolntsev closed 7 months ago

asolntsev commented 7 months ago

now user needs to write less code to generate PDF from HTML

This commits adds 2 new apis:

  1. class Html2Pdf (a single method for generating PDF from HTML)
  2. renderer.createPDF(doc, os) (instead of old sequence setDocument, layout, createPDF)

The work continues...

andreasrosdal commented 7 months ago

Is is possible to add a XHTML Validator Regex Fixer Class as part of this, to be optionally used, something like this. Maybe use ChatGPT to create a general XHTML cleanup, fixing and validation class. Possibly also add handling of other common edge-cases which causes XHTML validation errors.

Both Google Chrome and Firefox can handle some invalid HTML to an extent.

https://chat.openai.com/share/68dde394-7071-4657-b1ce-b898fe5c4465

import java.util.regex.*;

public class XHTMLValidator {

    public static String fixXHTML(String input) {
        // Fix start tags
        input = fixStartTags(input);
        // Fix end tags
        input = fixEndTags(input);
        return input;
    }

    private static String fixStartTags(String input) {
        Pattern startTagPattern = Pattern.compile("<(\\w+)([^>]*)>");
        Matcher matcher = startTagPattern.matcher(input);
        StringBuffer sb = new StringBuffer();
        while (matcher.find()) {
            String tag = matcher.group(1);
            String attributes = matcher.group(2);
            matcher.appendReplacement(sb, "<" + tag.toLowerCase() + attributes + ">");
        }
        matcher.appendTail(sb);
        return sb.toString();
    }

    private static String fixEndTags(String input) {
        Pattern endTagPattern = Pattern.compile("</(\\w+)([^>]*)>");
        Matcher matcher = endTagPattern.matcher(input);
        StringBuffer sb = new StringBuffer();
        while (matcher.find()) {
            String tag = matcher.group(1);
            matcher.appendReplacement(sb, "</" + tag.toLowerCase() + ">");
        }
        matcher.appendTail(sb);
        return sb.toString();
    }

    public static void main(String[] args) {
        String html = "<html><BODY><P>Invalid<p>missing end tag</BODY></html>";
        String fixedHTML = XHTMLValidator.fixXHTML(html);
        System.out.println("Fixed XHTML:\n" + fixedHTML);
    }
}
asolntsev commented 7 months ago

@andreasrosdal Yes, generally we could add some validation/html cleanup. But using regular expressions doesn't seem to be a good idea for this purpose: https://medium.com/thecyberfibre/stop-parsing-x-html-with-regular-expression-2cf13215b411

andreasrosdal commented 7 months ago

How about using Jsoup to fix invalid XHTML? https://jsoup.org/

In general I agree that regular expressions should not be used to fix bad HTML. However, we could use something, and at least regular expressions is one of the possible alternatives. Jsoup is probably better.

andreasrosdal commented 7 months ago

https://github.com/jhy/jsoup/

asolntsev commented 7 months ago

@andreasrosdal Yes, JSoup sounds good. Could you pelase share some example of such "invalid XHTML" needing a cleanup? I am trying to understand what problem we want to solve.

andreasrosdal commented 7 months ago

Here are some examples of html changes had to be made to make the html XHTML valid for flying saucer for PDF export: image image image image

asolntsev commented 7 months ago

@andreasrosdal Thank you for the samples. Some of these could be really replaced automatically (e.g. <link> -> <link></link>), but some others are actually invalid, and FS should throw exception in these cases (e.g. <<i class="">).

andreasrosdal commented 7 months ago

I wish that FS would handle these cases, and not throw exception, in the same way that Firefox and Chrome is fault tolerant for invalid HTML in many cases.