dgw / sopel-xkcdb

0 stars 0 forks source link

Migrate away from lxml #3

Open dgw opened 8 years ago

dgw commented 8 years ago

With the merging of sopel-irc/sopel#923, none of sopel's core modules require lxml any more. Migrate away, preferably to xmltodict like the core code has, so this module continues to have as few unique dependencies as possible.

dgw commented 8 years ago

xmltodict.parse() doesn't like HTML, and sadly XKCDB does not provide any official API. Scraping the HTML is the only way to get the content at the moment. There is an XML feed (RSS) of the latest quotes, but that allows neither selecting a random quote from the entire DB nor selecting a specific quote.

Unless I can talk the maintainer(s) of XKCDB into providing a proper API, this might have to be a CANTFIX. I'll at least poke through the sopel code to see how it handles HTML parsing if it's used anywhere in the core code or module set, since lxml was completely dropped (sopel-irc/sopel@21bbd98e72eef4c5454211a7adb70c1c8e640845) from the install requirements.

dgw commented 8 years ago

It doesn't appear that sopel uses any HTML parsing anywhere, from a quick search through the repo on GitHub. There are a few modules that used to reference HTMLParser, but don't appear to use it any more.

That said, it should be no big deal to switch from lxml's HTML parser to HTMLParser. Just a different refactor (and a need for importing sys to check platform version, because the module was reorganized after Python 2.x).