Open web64 opened 6 years ago
Same here.. Any news on that? What is the thing we have to correct?
Appears to be the :
doc.body or doc
statement
I actually was getting bad results, not just warnings (a string containing a repr of a byte buffer). Simple samples code did not have this, only with a real web page. Unclear if related (might warrant a new issue).
Ended up Monkey patching in a hack, still got warning but at least it worked:
from lxml.etree import tostring
import readability
from readability import Document # https://github.com/buriy/python-readability/ pip install readability-lxml
## monkey patch
def get_body(doc):
for elem in doc.xpath(".//script | .//link | .//style"):
elem.drop_tree()
# tostring() always return utf-8 encoded string
# FIXME: isn't better to use tounicode?
print('MY DEBUG')
#raw_html = str_(tostring(doc.body or doc))
#raw_html = tostring(doc.body or doc)
raw_html = tostring(doc.body or doc, encoding='utf-8').decode('utf-8')
#import pdb ; pdb.set_trace()
#raw_html = doc.body or doc
cleaned = readability.cleaners.clean_attributes(raw_html)
try:
# BeautifulSoup(cleaned) #FIXME do we really need to try loading it?
return cleaned
except Exception: # FIXME find the equivalent lxml error
# logging.error("cleansing broke html content: %s\n---------\n%s" % (raw_html, cleaned))
return raw_html
def content(self):
"""Returns document body"""
#return get_body(self._html(True))
print('MY DEBUG')
return get_body(self._html(True))
Document.content = content
## monkey patch
I was using one line to validate the response of a tag
Hi,
I'm getting this warning:
I'm running Python 3.5.2
Cheers!