buriy / python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!
https://github.com/buriy/python-readability
Apache License 2.0
2.66k stars 348 forks source link

FutureWarning: Use specific 'len(elem)' or 'elem is not None' test instead. #103

Open web64 opened 6 years ago

web64 commented 6 years ago

Hi,

I'm getting this warning:

readability/htmls.py:117: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.

I'm running Python 3.5.2

Cheers!

noembryo commented 5 years ago

Same here.. Any news on that? What is the thing we have to correct?

clach04 commented 1 year ago

Appears to be the :

doc.body or doc

statement

clach04 commented 1 year ago

I actually was getting bad results, not just warnings (a string containing a repr of a byte buffer). Simple samples code did not have this, only with a real web page. Unclear if related (might warrant a new issue).

Ended up Monkey patching in a hack, still got warning but at least it worked:

from lxml.etree import tostring
import readability
from readability import Document  # https://github.com/buriy/python-readability/   pip install readability-lxml

## monkey patch

def get_body(doc):
    for elem in doc.xpath(".//script | .//link | .//style"):
        elem.drop_tree()
    # tostring() always return utf-8 encoded string
    # FIXME: isn't better to use tounicode?
    print('MY DEBUG')
    #raw_html = str_(tostring(doc.body or doc))
    #raw_html = tostring(doc.body or doc)
    raw_html = tostring(doc.body or doc, encoding='utf-8').decode('utf-8')
    #import pdb ; pdb.set_trace()
    #raw_html = doc.body or doc
    cleaned = readability.cleaners.clean_attributes(raw_html)
    try:
        # BeautifulSoup(cleaned) #FIXME do we really need to try loading it?
        return cleaned
    except Exception:  # FIXME find the equivalent lxml error
        # logging.error("cleansing broke html content: %s\n---------\n%s" % (raw_html, cleaned))
        return raw_html

def content(self):
    """Returns document body"""
    #return get_body(self._html(True))
    print('MY DEBUG')
    return get_body(self._html(True))

Document.content = content
## monkey patch
Mustafahubs commented 1 year ago

image

I was using one line to validate the response of a tag