Exception Due to Bad Unicode Encoding

TAXIIProject / libtaxii

A Python library for handling TAXII Messages invoking TAXII Services.

http://libtaxii.readthedocs.org/

BSD 3-Clause "New" or "Revised" License

70 stars 42 forks source link

Exception Due to Bad Unicode Encoding #200

Closed brlogan closed 9 years ago

brlogan commented 9 years ago

I pulled some data from a TAXII feed and encountered some bytes that were not proper UTF-8. This triggered a "lxml.etree.XMLSyntaxError" exception. When get_message_from_xml is run on the response_message, etree.parse ultimately chokes on the improper bytes. It would be nice if this situation was handled more gracefully.

As a workaround, I can modify libtaxii and run xml_string = xml_string.decode('utf-8', 'replace') on the XML string prior to handing it off to lxml for parsing, but I'm thinking there may be a better way to handle this.

MarkDavidson commented 9 years ago

@brlogan,

Thank you for submitting the issue. Just so we're on the same page: are you envisioning a patch to libtaxii to autodetect the input string's encoding?

-Mark

brlogan commented 9 years ago

Yes, or at least something that would attempt a few of the more common encodings. In my case, it looks like I needed Windows-1252 (ANSI). Being able to specify an encoding, and an option to ignore or replace if a proper encoding cannot be found, might also be valuable.

MarkDavidson commented 9 years ago

What would you think about this:

Updating get_message_from_xml to look like this:

def get_message_from_xml(xml_string, encoding='utf_8'):
    ...
    decoded_string = xml_string.decode(encoding, 'strict')
    etree_xml = parse_xml_string(decoded_string)
    ...

And then updating get_message_from_http_response in __init__.py to pull out the encoding and pass it on oto the get_message_from_xml function.

Let me know what you think.

If I pushed this new code to a branch, do you have the ability to test it? I have added a test case for this issue (not pushed yet), but I want to make sure any fix also fixes your specific issue. -Mark

MarkDavidson commented 9 years ago

https://github.com/TAXIIProject/libtaxii/compare/issue_200

Added tests for the issue (both TAXII 1.0 and TAXII 1.1)
Updated get_messagefrom(xml|json) in both 1.0 and 1.1
Updated libtaxii.get_message_from_http_response to parse the character encoding out of the HTTP Response (from the Content-Type header)

I don't have a quick way to test various encodings from a webserver - does anyone have a quick way to test that?

@brlogan, Do these changes look like what you were thinking?

-Mark

brlogan commented 9 years ago

@MarkDavidson - Not sure how I missed your comment from a couple weeks ago, but I'll go ahead and give this a test on Wednesday when I have access to the right system. I'll let you know if it addresses the issue. Thanks!

brlogan commented 9 years ago

@MarkDavidson - Your changes do prevent the exception from occurring, so that's a big help! Two thoughts:

It might be valuable to have a way to provide the correct encoding in case an encoding isn't specified in the headers. If a user uses get_message_from_http_response, they have no way to provide an encoding even if they know it.
With your changes, if no encoding is specified, and UTF-8 is not correct, the unknown characters are simply replaced with a placeholder. Might it be worth it to try an additional encoding? Just an idea; I'm not sure how much sense this really makes. Roughly 93% of the internet is made up of just 2 encodings: http://w3techs.com/technologies/overview/character_encoding/all You could do something like:

try:
    decoded_string = xml_string.decode('utf-8', 'strict')
except UnicodeDecodeError:
    decoded_string = xml_string.decode('latin1', 'replace')

brlogan commented 9 years ago

Oh, I almost forgot! I ran into another issue when testing your change. The version of Python I was running 2.7.3 doesn't have ssl._create_default_https_context: https://github.com/TAXIIProject/libtaxii/blob/master/libtaxii/clients.py#L422

After a little research, it looks like that may have been added in 2.7.9, but don't quote me on that. You may have to account for the micro version in your first if statement. Something like:

if ((sys.version_info.major == 2 and sys.version_info.minor == 6) or
    (sys.version_info.major == 2 and sys.version_info.minor == 7 and sys.version_info.micro < 9)):

MarkDavidson commented 9 years ago

@brlogan,

Sorry for the long delay in response. It looks to me like that is another issue, so I'll close this one and open a new issue for ssl._create_default_https_context

Thank you. -Mark

MarkDavidson commented 9 years ago

https://github.com/TAXIIProject/libtaxii/issues/203 for @brlogan's issue