flyingsaucerproject / flyingsaucer

XML/XHTML and CSS 2.1 renderer in pure Java
Other
2.02k stars 564 forks source link

JSoup HTML parser in separate module #391

Closed andreasrosdal closed 2 months ago

pbrant commented 2 months ago

Hey Andreas, thanks for the PR. I appreciate the effort that went into it. I'm afraid it's kind of an example of "hunting mice with an elephant gun" though.

It would be less invasive to add a service interface to allow a user to swap out the DOM parser implementation used by the Swing-based mini-browser (either auto-configured by the presence of the module or explicitly swapped out through configuration). I think I may have suggested this before.

Supporting additional CSS properties is an almost entirely orthogonal problem to the DOM parser in use. This could be done while using an XML, an XHTML, or HTML5 parser to create the DOM.

We do have some experience with copy-n-pasted modules. The old flying-saucer-pdf-itext5 module was effectively a clone of flying-saucer-pdf with package changes and minor API updates.

To put it bluntly, it was a disaster. It had already bitrotted rather badly by the time it was deleted as most contributed fixes only touched flying-saucer-pdf. I'm quite happy that Andrei had the courage to delete it.

It's awesome that you'd like to start experimenting with supporting more CSS properties and adding JavaScript. It is a hugely ambitious task. I'd suggest starting that effort in a separate fork to see how it goes.

asolntsev commented 1 month ago

@andreasrosdal @pbrant In fact, we already have an example showing how to use JSoup to parse HTML:

https://github.com/flyingsaucerproject/flyingsaucer/blob/main/flying-saucer-examples/src/test/java/org/xhtmlrenderer/pdf/PdfFromInvalidHtmlTest.java

But yes, we could improve it even more by service loader mechanism...