datalib / libextract

Extract data from websites using basic statistical magic
MIT License
504 stars 45 forks source link

Added supplemental method for creating lxml HTML element #38

Closed soldni closed 2 years ago

soldni commented 8 years ago

While trying libextract out, I've notice issue when trying to parse some UTF-8 files (e.g., this one):

>>> from libextract.api import extract
>>> import codecs
>>> data = codecs.open('load_with_utf8.html', encoding='utf-8').read()
>>> obj = extract(data)

<type 'str'>: (<type 'exceptions.UnicodeEncodeError'>, UnicodeEncodeError('ascii', u'Error reading file \'<!DOCTYPE html PUBLIC "-W3CDTD XHTML 1.0 TransitionalEN" "http:www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n\n<html xmlns="http:www.w3.org/1999/xhtml">
.... [several lines of text] ...
failed to load external entity "<!DOCTYPE html PUBLIC "-W3CDTD XHTML 1.0 TransitionalEN" "http:www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n\n<', 15051, 15052, 'ordinal not in range(128)'))

Upon investigation, this seems to be due to the fact that, when invoking lxml.html.parse (as done in libextract.core.parse_html) with some type of HTML content encoded in unicode (perhaps malformed html?), an UnicodeEncodeError is raised.

The solution, as suggested in this StackOverflow answer, is to use lxml.etree.fromstring instead. So I've added it to the parse_html function, falling back to the previous method if needed.


Steps to reproduce the issue:

  1. download this html file
  2. execute the code above

Solution:

Accept this pull request :smiley:

rodricios commented 8 years ago

Hi @lucasoldaini, thank you for the pull request :) Looking into the issue, you will hear back from me soon!