Closed asolntsev closed 7 months ago
Is is possible to add a XHTML Validator Regex Fixer Class as part of this, to be optionally used, something like this. Maybe use ChatGPT to create a general XHTML cleanup, fixing and validation class. Possibly also add handling of other common edge-cases which causes XHTML validation errors.
Both Google Chrome and Firefox can handle some invalid HTML to an extent.
https://chat.openai.com/share/68dde394-7071-4657-b1ce-b898fe5c4465
import java.util.regex.*;
public class XHTMLValidator {
public static String fixXHTML(String input) {
// Fix start tags
input = fixStartTags(input);
// Fix end tags
input = fixEndTags(input);
return input;
}
private static String fixStartTags(String input) {
Pattern startTagPattern = Pattern.compile("<(\\w+)([^>]*)>");
Matcher matcher = startTagPattern.matcher(input);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
String tag = matcher.group(1);
String attributes = matcher.group(2);
matcher.appendReplacement(sb, "<" + tag.toLowerCase() + attributes + ">");
}
matcher.appendTail(sb);
return sb.toString();
}
private static String fixEndTags(String input) {
Pattern endTagPattern = Pattern.compile("</(\\w+)([^>]*)>");
Matcher matcher = endTagPattern.matcher(input);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
String tag = matcher.group(1);
matcher.appendReplacement(sb, "</" + tag.toLowerCase() + ">");
}
matcher.appendTail(sb);
return sb.toString();
}
public static void main(String[] args) {
String html = "<html><BODY><P>Invalid<p>missing end tag</BODY></html>";
String fixedHTML = XHTMLValidator.fixXHTML(html);
System.out.println("Fixed XHTML:\n" + fixedHTML);
}
}
@andreasrosdal Yes, generally we could add some validation/html cleanup. But using regular expressions doesn't seem to be a good idea for this purpose: https://medium.com/thecyberfibre/stop-parsing-x-html-with-regular-expression-2cf13215b411
How about using Jsoup to fix invalid XHTML? https://jsoup.org/
In general I agree that regular expressions should not be used to fix bad HTML. However, we could use something, and at least regular expressions is one of the possible alternatives. Jsoup is probably better.
@andreasrosdal Yes, JSoup sounds good. Could you pelase share some example of such "invalid XHTML" needing a cleanup? I am trying to understand what problem we want to solve.
Here are some examples of html changes had to be made to make the html XHTML valid for flying saucer for PDF export:
@andreasrosdal Thank you for the samples. Some of these could be really replaced automatically (e.g. <link>
-> <link></link>
), but some others are actually invalid, and FS should throw exception in these cases (e.g. <<i class="">
).
I wish that FS would handle these cases, and not throw exception, in the same way that Firefox and Chrome is fault tolerant for invalid HTML in many cases.
now user needs to write less code to generate PDF from HTML
This commits adds 2 new apis:
setDocument
,layout
,createPDF
)The work continues...