flyingsaucerproject / flyingsaucer

XML/XHTML and CSS 2.1 renderer in pure Java
Other
2.02k stars 564 forks source link

Use Jsoup to parse HTML #327

Closed andreasrosdal closed 5 months ago

andreasrosdal commented 6 months ago

Use JSOUP to parse HTML. https://jsoup.org/

Deprecate XMLResource. Add HTMLResource.

Jsoup is a HTML parser, so it could work better to parse HTML than the SAX parser currently in use. I consider this a proposal, which I think is a step in the right direction. However, I am not fully sure that I understand all the consequences of this yet. So I would like to propose this change, maybe it will be accepted. Related: #279 and #282.

Further, Jsoup is a HTML parser, and most users of Flying saucer will be expecting HTML syntax to be valid, not just XHTML. The current parser in Flysing Saucer is a SAX based XML parser which will throw exceptions if there input is not valid XHTML.

And importantly, we need to make sure that this doesn't introduce any XSS or HTML based vulnerabilities.

andreasrosdal commented 6 months ago

@jhy Does this use of Jsoup look fine?

andreasrosdal commented 6 months ago

I think Jsoup can help bring HTML5 support to FS eventually, and generally improve the HTML parsing. Specifying parsing tolerance would be nice. We can try to find out how to do this using Jsoup.

pbrant commented 5 months ago

I agree with Andrei.

FS is fundamentally a library meant to be used as part of a larger application. We wouldn't want to force every user to include jsoup as a dependency.

A good alternative approach would be to make this a separate optional module (ala flying-saucer-log4j) that users can use if they want, but can otherwise ignore if they're happy with what they're currently doing.

andreasrosdal commented 5 months ago

Yes, I can address these comments and move the jsoup support to a separate module.

How about a separate module for htmlunit-neko also, in a similar way. As pointed out by @rbri in https://github.com/flyingsaucerproject/flyingsaucer/issues/282#issuecomment-2133685493 it seems that htmlunit-neko is also a quite capable html parser. So I am thinking it could be useful to support both jsoup and htmlunit-neko parsers, but I'm not sure how at this time.

rbri commented 5 months ago

How about a separate module for htmlunit-neko also, in a similar way. As pointed out by @rbri in https://github.com/flyingsaucerproject/flyingsaucer/issues/282#issuecomment-2133685493 it seems that htmlunit-neko is also a quite capable html parser. So I am thinking it could be useful to support both jsoup and htmlunit-neko parsers, but I'm not sure how at this time.

Great idea - will try to support this