Closed brlogan closed 9 years ago
@brlogan,
Thank you for submitting the issue. Just so we're on the same page: are you envisioning a patch to libtaxii to autodetect the input string's encoding?
-Mark
Yes, or at least something that would attempt a few of the more common encodings. In my case, it looks like I needed Windows-1252 (ANSI). Being able to specify an encoding, and an option to ignore or replace if a proper encoding cannot be found, might also be valuable.
What would you think about this:
Updating get_message_from_xml
to look like this:
def get_message_from_xml(xml_string, encoding='utf_8'):
...
decoded_string = xml_string.decode(encoding, 'strict')
etree_xml = parse_xml_string(decoded_string)
...
And then updating get_message_from_http_response
in __init__.py
to pull out the encoding and pass it on oto the get_message_from_xml
function.
Let me know what you think.
If I pushed this new code to a branch, do you have the ability to test it? I have added a test case for this issue (not pushed yet), but I want to make sure any fix also fixes your specific issue. -Mark
https://github.com/TAXIIProject/libtaxii/compare/issue_200
libtaxii.get_message_from_http_response
to parse the character encoding out of the HTTP Response (from the Content-Type
header)I don't have a quick way to test various encodings from a webserver - does anyone have a quick way to test that?
@brlogan, Do these changes look like what you were thinking?
-Mark
@MarkDavidson - Not sure how I missed your comment from a couple weeks ago, but I'll go ahead and give this a test on Wednesday when I have access to the right system. I'll let you know if it addresses the issue. Thanks!
@MarkDavidson - Your changes do prevent the exception from occurring, so that's a big help! Two thoughts:
get_message_from_http_response
, they have no way to provide an encoding even if they know it.try:
decoded_string = xml_string.decode('utf-8', 'strict')
except UnicodeDecodeError:
decoded_string = xml_string.decode('latin1', 'replace')
Oh, I almost forgot! I ran into another issue when testing your change. The version of Python I was running 2.7.3 doesn't have ssl._create_default_https_context
:
https://github.com/TAXIIProject/libtaxii/blob/master/libtaxii/clients.py#L422
After a little research, it looks like that may have been added in 2.7.9, but don't quote me on that. You may have to account for the micro version in your first if statement. Something like:
if ((sys.version_info.major == 2 and sys.version_info.minor == 6) or
(sys.version_info.major == 2 and sys.version_info.minor == 7 and sys.version_info.micro < 9)):
@brlogan,
Sorry for the long delay in response. It looks to me like that is another issue, so I'll close this one and open a new issue for ssl._create_default_https_context
Thank you. -Mark
https://github.com/TAXIIProject/libtaxii/issues/203 for @brlogan's issue
I pulled some data from a TAXII feed and encountered some bytes that were not proper UTF-8. This triggered a "lxml.etree.XMLSyntaxError" exception. When get_message_from_xml is run on the response_message, etree.parse ultimately chokes on the improper bytes. It would be nice if this situation was handled more gracefully.
As a workaround, I can modify libtaxii and run
xml_string = xml_string.decode('utf-8', 'replace')
on the XML string prior to handing it off to lxml for parsing, but I'm thinking there may be a better way to handle this.