TeamHG-Memex / html-text

Extract text from HTML
MIT License
130 stars 24 forks source link

Handle non-breaking spaces and other special unicode characters #6

Open lopuhin opened 7 years ago

lopuhin commented 7 years ago

See discussion in https://github.com/TeamHG-Memex/html-text/pull/2#issuecomment-304737274

codinguncut commented 7 years ago

not sure if this is the same issue, but I'm getting:

ERROR:scrapy.core.scraper:Spider error processing <GET http://www.magnetoinvestigators
.com/contact-us> (referer: http://www.magnetoinvestigators.com)
Traceback (most recent call last):
  File "/home/johannes/.virtualenvs/broadcrawl/lib/python3.5/site-packages/html_text/h
tml_text.py", line 77, in cleaned_selector
    tree = _cleaned_html_tree(html)
  File "/home/johannes/.virtualenvs/broadcrawl/lib/python3.5/site-packages/html_text/h
tml_text.py", line 33, in _cleaned_html_tree
    tree = lxml.html.fromstring(html.encode('utf8'), parser=parser)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udb9e' in position 785: sur
rogates not allowed

Apparently this is a new strictness introduced by python 3. Possibly using surrogateescape flag in encode could help...?

Also see:

lopuhin commented 7 years ago

Thanks for report @codinguncut ! For now you can work around this issue by parsing the document yourself and passing lxml.html.HtmlElement into html_text.extract_text.

kmike commented 7 years ago

The issue is that Scrapy used Content-Type header to get the encoding ('utf-7'), while the site in fact seems to return utf-8. Then Scrapy decodes body using 'errors=replace' (w3lib_replace to be precise, see https://github.com/scrapy/w3lib/blob/34435d085c6adb14c94cd0188c23f6dc7d4da0f7/w3lib/encoding.py#L174) - and this produces an output which can't be encoded back to utf-8 for some reason.

I think the right place to fix it is probably w3lib. html-text can provide extra robustness by using surrogateescape, but it should be better to get a proper unicode body before passing it to html_text.

kmike commented 7 years ago

FTR, response.css / response.xpath also don't work for this website.