While trying libextract out, I've notice issue when trying to parse some UTF-8 files (e.g., this one):
>>> from libextract.api import extract
>>> import codecs
>>> data = codecs.open('load_with_utf8.html', encoding='utf-8').read()
>>> obj = extract(data)
<type 'str'>: (<type 'exceptions.UnicodeEncodeError'>, UnicodeEncodeError('ascii', u'Error reading file \'<!DOCTYPE html PUBLIC "-W3CDTD XHTML 1.0 TransitionalEN" "http:www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n\n<html xmlns="http:www.w3.org/1999/xhtml">
.... [several lines of text] ...
failed to load external entity "<!DOCTYPE html PUBLIC "-W3CDTD XHTML 1.0 TransitionalEN" "http:www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n\n<', 15051, 15052, 'ordinal not in range(128)'))
Upon investigation, this seems to be due to the fact that, when invoking lxml.html.parse (as done in libextract.core.parse_html) with some type of HTML content encoded in unicode (perhaps malformed html?), an UnicodeEncodeError is raised.
The solution, as suggested in this StackOverflow answer, is to use lxml.etree.fromstring instead. So I've added it to the parse_html function, falling back to the previous method if needed.
While trying libextract out, I've notice issue when trying to parse some UTF-8 files (e.g., this one):
Upon investigation, this seems to be due to the fact that, when invoking
lxml.html.parse
(as done inlibextract.core.parse_html
) with some type of HTML content encoded in unicode (perhaps malformed html?), anUnicodeEncodeError
is raised.The solution, as suggested in this StackOverflow answer, is to use
lxml.etree.fromstring
instead. So I've added it to the parse_html function, falling back to the previous method if needed.Steps to reproduce the issue:
Solution:
Accept this pull request :smiley: