Open lopuhin opened 7 years ago
not sure if this is the same issue, but I'm getting:
ERROR:scrapy.core.scraper:Spider error processing <GET http://www.magnetoinvestigators
.com/contact-us> (referer: http://www.magnetoinvestigators.com)
Traceback (most recent call last):
File "/home/johannes/.virtualenvs/broadcrawl/lib/python3.5/site-packages/html_text/h
tml_text.py", line 77, in cleaned_selector
tree = _cleaned_html_tree(html)
File "/home/johannes/.virtualenvs/broadcrawl/lib/python3.5/site-packages/html_text/h
tml_text.py", line 33, in _cleaned_html_tree
tree = lxml.html.fromstring(html.encode('utf8'), parser=parser)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udb9e' in position 785: sur
rogates not allowed
Apparently this is a new strictness introduced by python 3.
Possibly using surrogateescape
flag in encode
could help...?
Also see:
Thanks for report @codinguncut ! For now you can work around this issue by parsing the document yourself and passing lxml.html.HtmlElement
into html_text.extract_text
.
The issue is that Scrapy used Content-Type header to get the encoding ('utf-7'), while the site in fact seems to return utf-8. Then Scrapy decodes body using 'errors=replace' (w3lib_replace to be precise, see https://github.com/scrapy/w3lib/blob/34435d085c6adb14c94cd0188c23f6dc7d4da0f7/w3lib/encoding.py#L174) - and this produces an output which can't be encoded back to utf-8 for some reason.
I think the right place to fix it is probably w3lib. html-text can provide extra robustness by using surrogateescape, but it should be better to get a proper unicode body before passing it to html_text.
FTR, response.css / response.xpath also don't work for this website.
See discussion in https://github.com/TeamHG-Memex/html-text/pull/2#issuecomment-304737274