How skip HTML validation while generating PDF?

asolntsev commented 9 months ago

Reported by (Ezhil](mailto:ezhilre@gmail.com)

Team - I am trying to create a PDF using page url. But I am getting an error saying that

Can't load the XML resource (using TrAX transformer). org.xml.sax.SAXParseException; lineNumber: 6; columnNumber: 14; Open quote is expected for attribute "name" associated with an element type "meta".

It looks like renderer.setDocument(urlcheck) check whether the URL has proper start and end HTML tag. Is there any we can skip this validation ?

try {
  // Define the URL
  String urlcheck = "https://en.wikipedia.org/wiki/IPhone_15";

  // Establish a URL connection
  HttpURLConnection connection = (HttpURLConnection) new URL(urlcheck).openConnection();
  connection.setRequestMethod("GET");

  // Check the response code (200 indicates success)
  int responseCode = connection.getResponseCode();
  if (responseCode == 200) {
    // Get the input stream from the connection
    InputStream urlInputStream = connection.getInputStream();

    // Create an ITextRenderer instance
    ITextRenderer renderer = new ITextRenderer();

    // Set the HTML content as the document
    renderer.setDocument(urlcheck);

    // Render to PDF
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    renderer.layout();
    renderer.createPDF(outputStream);
    renderer.finishPDF();
  }

asolntsev commented 9 months ago

Answer from Peter Brand:

I'm afraid that won't work in general. FS is a pretty complete static implementation of CSS 2.1. It does not support JavaScript or the many, many features subsequently added to CSS and HTML.

In order to limit the number of external dependencies, FS only supports XML input out of the box, but it provides the facilities to use your own parser as long as the output of that parser is a W3C Document value.

pbrant commented 5 months ago

See also the JSoup example provided in #299. A similar technique would work with https://github.com/HtmlUnit/htmlunit-neko or the validator.nu HTML5 parser.

I'm afraid the first paragraph above still applies though. Sites that use JavaScript, CSS Flexbox, CSS Grid, etc. will still be pretty broken. There is no easy fix there.

flyingsaucerproject / flyingsaucer

How skip HTML validation while generating PDF? #228