divio / aldryn-search

Haystack 2.0 search index for django CMS
Other
48 stars 77 forks source link

Consider switching from lxml's clean_html for enhanced security (and possibly performance) #115

Open frenzymadness opened 10 months ago

frenzymadness commented 10 months ago

I'd like to bring to your attention that we are discussing the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – CVE-2021-43818, CVE-2021-28957, CVE-2020-27783, CVE-2018-19787 and CVE-2014-3146.

The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.

Two viable alternatives worth considering are bleach and nh3. Here's why:

bleach:

nh3:

We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.

Let me know if we can help you with this transition anyhow and have a nice day.

frenzymadness commented 3 months ago

Just an update on this. The latest version of lxml (5.2.0) no longer contains the HTML cleaner. Its code is now available as a dedicated project on GitHub and PyPI.

If you want to continue using it, you can either depend on lxml[html_clean] or on lxml_html_clean directly. lxml contains backward-compatible import so there is nothing else you need to change than the dependency.