adbar / htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
https://htmldate.readthedocs.io
Apache License 2.0
117 stars 26 forks source link

Consider switching from lxml's clean_html for enhanced security (and possibly performance) #91

Closed frenzymadness closed 8 months ago

frenzymadness commented 10 months ago

I'd like to bring to your attention that we are discussing the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – CVE-2021-43818, CVE-2021-28957, CVE-2020-27783, CVE-2018-19787 and CVE-2014-3146.

The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.

Two viable alternatives worth considering are bleach and nh3. Here's why:

bleach:

nh3:

We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.

Let me know if we can help you with this transition anyhow and have a nice day.

adbar commented 10 months ago

See https://github.com/adbar/trafilatura/issues/412