ericflo / pynzb

pynzb is a unified API for parsing NZB files, with several concrete implementations included
BSD 3-Clause "New" or "Revised" License
18 stars 6 forks source link

Add Python 3 support to LXML parser #6

Closed arcresu closed 2 years ago

arcresu commented 7 years ago

The lxml etree API changed in Python 3 to take BytesIO instead of StringIO. This patch maintains the original behaviour in Python 2 but switches to BytesIO in Python 3, decoding the XML data as UTF-8.

In combination with #5 this change gives is sufficient to get everything working in Python 3. If there turns out to be some problem with assuming UTF-8 encoding then this could either be elaborated upon or the LXML implementation could just be disabled for Python 3 as an easy way out.

vadmium commented 7 years ago

If only Python 2.6+ has to be supported, you might be able to use io.BytesIO unconditionally.

In Issue #3, I pointed at my own Python 3 changes. The main difference seems to be you accept text to be encoded in UTF-8, while I accepted pre-encoded bytes. See also the test suite, which has an XML string declared to be encoded in Latin-1 (iso-8859-1). Although that string seems to be only ASCII, so UTF-8 would also work there.

arcresu commented 7 years ago

Using BytesIO unconditionally is fine for my purposes and much simpler, but I'm not sure which Python versions @ericflo wanted to maintain compatibility with.

As for the decoding step, I don't really have any strong opinion about how that should work, but my suspicion is that the chances of the encoding as specified within the XML coinciding with the actual encoding are fairly slim in real world usage.