DocNow / diffengine

track changes to the news, where news is anything with an RSS feed
MIT License
177 stars 30 forks source link

Empty URLs/documents in feed seem to crash diffengine #43

Open ryanfb opened 6 years ago

ryanfb commented 6 years ago

Last line in log: 017-10-18 13:18:57,133 - root - INFO - checking https://gateway.itstgate.com/WebLink2/WebLink.aspx

Traceback:

Traceback (most recent call last):
  File "/usr/local/bin/diffengine", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 483, in main
    version = entry.get_latest()
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 156, in get_latest
    title = doc.title()
  File "/usr/local/lib/python3.6/site-packages/readability/readability.py", line 137, in title
    return get_title(self._html(True))
  File "/usr/local/lib/python3.6/site-packages/readability/readability.py", line 108, in _html
    self.html = self._parse(self.input)
  File "/usr/local/lib/python3.6/site-packages/readability/readability.py", line 117, in _parse
    doc, self.encoding = build_doc(input)
  File "/usr/local/lib/python3.6/site-packages/readability/htmls.py", line 21, in build_doc
    doc = lxml.html.document_fromstring(decoded_page.encode('utf-8', 'replace'), parser=utf8_parser)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/__init__.py", line 765, in document_fromstring
    "Document is empty")
lxml.etree.ParserError: Document is empty