buriy / python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!
https://github.com/buriy/python-readability
Apache License 2.0
2.65k stars 348 forks source link

Consider switching from lxml's clean_html for enhanced security (and possibly performance) #179

Open frenzymadness opened 1 year ago

frenzymadness commented 1 year ago

I'd like to bring to your attention that we are discussing the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – CVE-2021-43818, CVE-2021-28957, CVE-2020-27783, CVE-2018-19787 and CVE-2014-3146.

The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.

Two viable alternatives worth considering are bleach and nh3. Here's why:

bleach:

nh3:

We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.

Let me know if we can help you with this transition anyhow and have a nice day.

frenzymadness commented 5 months ago

Just an update on this. The latest version of lxml (5.2.0) no longer contains the HTML cleaner. Its code is now available as a dedicated project on GitHub and PyPI.

If you want to continue using it, you can either depend on lxml[html_clean] or on lxml_html_clean directly. lxml contains backward-compatible import so there is nothing else you need to change than the dependency.

buriy commented 5 months ago

Thanks, we use bleach already and we'll move to bleach then.

clintgibler commented 5 months ago

Hey! Thanks for this awesome project 🙏

Quick question, I'm trying to use readability and I'm getting this error:

from readability import Document
  File ".../my_file.py", line 3, in <module>
    from readability import Document
  File ".../python3.8/site-packages/readability/__init__.py", line 3, in <module>
    from .readability import Document
  File ".../python3.8/site-packages/readability/readability.py", line 11, in <module>
    from .cleaners import clean_attributes
  File ".../python3.8/site-packages/readability/cleaners.py", line 3, in <module>
    from lxml.html.clean import Cleaner
  File ".../python3.8/site-packages/lxml/html/clean.py", line 18, in <module>
    raise ImportError(
ImportError: lxml.html.clean module is now a separate project lxml_html_clean.
Install lxml[html_clean] or lxml_html_clean directly.

Is this my fault / something I can fix on my end?

Apologies if this is an obvious issue.

frenzymadness commented 5 months ago

Hi @clintgibler. The fix for this should be implemented in readability project but in the meantime, you can workaround the problem by simply following the advice from the error message: install lxml[html_clean] or lxml_html_clean - those should be equivalent for you. An alternative is to install an older version of lxml (lxml<5.2.0 should work IIRC).

clintgibler commented 5 months ago

Thanks so much for the quick response @frenzymadness 🙏 . Yes, pip install lxml[html_clean] worked.