ColdHeat / pybluemonday

pybluemonday is a library for sanitizing HTML very quickly via bluemonday.
BSD 3-Clause "New" or "Revised" License
34 stars 11 forks source link

Consider switching from lxml's clean_html for enhanced security (and possibly performance) #44

Closed frenzymadness closed 1 year ago

frenzymadness commented 1 year ago

I'd like to bring to your attention that we are discussing the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – CVE-2021-43818, CVE-2021-28957, CVE-2020-27783, CVE-2018-19787 and CVE-2014-3146.

The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.

Two viable alternatives worth considering are bleach and nh3. Here's why:

bleach:

nh3:

We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.

Let me know if we can help you with this transition anyhow and have a nice day.

ColdHeat commented 1 year ago

Hello, this repo is a HTML sanitization library similar to nh3 but based in Golang (with https://github.com/microcosm-cc/bluemonday) instead of Rust. You may consider it as a possible suggestion for alternative libraries. There are more features in pybluemonday but the interface is not as simple as nh3's.

I wasn't aware of nh3 before this post so thank you for bringing it to my attention.

lxml is just used for benchmarks so we are okay to just pin to an old version and use that.

FWIW I think removing this functionality from lxml is a good idea.

frenzymadness commented 1 year ago

Thank you for the clarification and very quick response.

frenzymadness commented 5 months ago

Just an update on this. The latest version of lxml (5.2.0) no longer contains the HTML cleaner. Its code is now available as a dedicated project on GitHub and PyPI.

If you want to continue using it, you can either depend on lxml[html_clean] or on lxml_html_clean directly. lxml contains backward-compatible import so there is nothing else you need to change than the dependency.