flyingsaucerproject / flyingsaucer

XML/XHTML and CSS 2.1 renderer in pure Java
Other
1.94k stars 549 forks source link

Support for HTML 5 #282

Open andreasrosdal opened 3 months ago

andreasrosdal commented 3 months ago

FS should support HTML 5.

To update the flyingsaucerproject/flyingsaucer library for essential HTML5 support, focus on key areas that are most impactful for modern web document standards: (chatgpt suggestions:)

  1. HTML5 Parsing: Integrate an HTML5-compliant parser to accurately handle HTML5 documents. This is crucial for recognizing new semantic elements and properly parsing the document structure.

  2. CSS3 Enhancements: Update the CSS rendering engine to support important CSS3 features such as flexbox for layout, media queries for responsive design, and transitions for visual effects. These are foundational for modern web design practices.

  3. Semantic Elements Support: Specifically target support for new semantic elements like <article>, <section>, <nav>, <header>, <footer>, and <figure>. Ensuring these elements are correctly interpreted and rendered is essential for modern web documents.

  4. Form Controls and Input Types: Enhance support for the new form elements and input types introduced in HTML5. This includes types like email, date, range, and color, which are increasingly used in web forms.

  5. JavaScript Interface: Since HTML5 relies on JavaScript for dynamic content, consider how flyingsaucer might either interface with JavaScript or provide hooks for external JavaScript interaction, especially for form validation and handling new input types.

  6. Test Suite for HTML5: Develop a targeted test suite focusing on HTML5 features to ensure compatibility and adherence to standards. Utilize parts of the W3C HTML5 Test Suite for comprehensive coverage.

  7. Documentation and Modular Approach: Update documentation to reflect the support for HTML5 and consider a modular approach for HTML5 features, allowing users to enable specific functionalities as needed. This strategy helps in managing performance implications and maintains backward compatibility.

By concentrating on these aspects, flyingsaucer can significantly improve its HTML5 support, aligning it with current web standards and enhancing its utility for modern web document rendering.

Integrating an HTML5-compliant parser into the flyingsaucerproject/flyingsaucer library involves several detailed steps to ensure accurate handling of HTML5 documents. These steps are crucial for recognizing new semantic elements and properly parsing the document structure:

  1. Evaluate Existing Parser: Assess the capabilities and limitations of the current parsing mechanism in flyingsaucer to understand how it handles HTML and where it falls short with HTML5 content.

  2. Select an HTML5 Parser: Choose an HTML5-compliant parser that can be integrated into flyingsaucer. Popular Java-based parsers like Jsoup or HTMLUnit have strong support for HTML5 and offer a good balance between performance and ease of use.

https://www.w3.org/TR/2011/WD-html5-20110405/ https://html.spec.whatwg.org/

Possibly some implementation details can be copied from: https://github.com/openhtmltopdf/openhtmltopdf/

rbri commented 3 months ago

Maybe https://github.com/HtmlUnit/htmlunit-neko is of help here. This

Because my time is limited i can't provide a impl but i will support this if you like...

rbri commented 1 month ago

He folks, i did some minor experiments...

Then starting the browser and pointing to an plain html page image

I think this is not that bad compared to image

rbri commented 1 month ago

Because neko fixes many issues of real world documents, i was also able to open https://www.htmlunit.org/

Before: image

After: image

andreasrosdal commented 1 month ago

@rbri I would like to encourage you to make a pull request which allows using the neko-htmlunit html parser in Flying Saucer, in a default way without any hassle configuration, because this html parser is clearly better than the current xml sax parser. This could make it much easier to recommend using Flying Saucer to the developers in the company I work, because at the moment FS is no good because it only supports strict xhtml and developers look for alternatives to FS now.

rbri commented 1 month ago

@andreasrosdal PR is there ;-) if guess we need some discussion about the right way to do it (maybe a service and a different subproject?)