Added supplemental method for creating lxml HTML element

While trying libextract out, I've notice issue when trying to parse some UTF-8 files (e.g., this one):

>>> from libextract.api import extract
>>> import codecs
>>> data = codecs.open('load_with_utf8.html', encoding='utf-8').read()
>>> obj = extract(data)

<type 'str'>: (<type 'exceptions.UnicodeEncodeError'>, UnicodeEncodeError('ascii', u'Error reading file \'<!DOCTYPE html PUBLIC "-W3CDTD XHTML 1.0 TransitionalEN" "http:www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n\n<html xmlns="http:www.w3.org/1999/xhtml">
.... [several lines of text] ...
failed to load external entity "<!DOCTYPE html PUBLIC "-W3CDTD XHTML 1.0 TransitionalEN" "http:www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n\n<', 15051, 15052, 'ordinal not in range(128)'))

Upon investigation, this seems to be due to the fact that, when invoking lxml.html.parse (as done in libextract.core.parse_html) with some type of HTML content encoded in unicode (perhaps malformed html?), an UnicodeEncodeError is raised.

The solution, as suggested in this StackOverflow answer, is to use lxml.etree.fromstring instead. So I've added it to the parse_html function, falling back to the previous method if needed.

Steps to reproduce the issue:

download this html file
execute the code above

Solution:

Accept this pull request :smiley:

datalib / libextract

Added supplemental method for creating lxml HTML element #38