Text in FileObject.size_in_bytes Field Causes Error

CybOXProject / python-cybox

A Python library for parsing, manipulating, and generating CybOX content.

http://cybox.readthedocs.org/

BSD 3-Clause "New" or "Revised" License

77 stars 42 forks source link

Text in FileObject.size_in_bytes Field Causes Error #278

Open brlogan opened 8 years ago

brlogan commented 8 years ago

I have come across many STIX documents that include units in the FileObject.size_in_bytes field (despite its name). For example, I have a bunch of STIX documents with size_in_bytes values like "123 bytes". Because python-cybox tries to convert this value to a long, I get an "invalid literal for long() with base 10" error. From a TAXII/STIX perspective, it's annoying to have an entire STIX package fail because of this.

I'm not sure what the "right" answer to this problem is, but even something as simple as stripping out "bytes" or "B" would be helpful. If you wanted to get fancy, you could also do conversions for things like "KB", "MB", etc.

brlogan commented 8 years ago

This issue was brought up again earlier this month on the cit-users list. While more granular exception handling would help us not lose the entire package, even just handling "bytes" in the field would be very helpful. Any thoughts on this?

Matthew Hall writes:

...when the code comes across a CybOX FileObj w/ a bogus Size_In_Bytes, the exception disrupts parsing the entire STIX Package not just the corrupted / invalid entity:
380058 bytes/FileObj:Size_In_Bytes ValueError: invalid literal for long() with base 10: '380058 bytes' File ".../venv/lib/python2.7/site-packages/cybox/common/properties.py", line 514, in _parse_value return long(value, 0) How can I perform a best-effort parse with python-stix in order to operate as properly as possible in such situations?

gtback commented 8 years ago

We could certainly strip out all non-digit characters. This would break for things like "KB", but content like that wouldn't have worked in the first place. I'm a bit hesitant to do anything more than that. One thing about the current approach is that it handles hex ("0x") numbers correctly, but would break with this naive approach. Side note: I think octal numbers would be OK, since a leading "0" (except for "0x") causes Python to interpret it as octal, even if it's not "0o".

brlogan commented 8 years ago

I have never seen hex or octal values in this field, but "bytes" seems to be pretty common. I don't think just stripping non-digit characters would be a good choice. I'd prefer to handle/convert for a few common cases (bytes, KB, MB, etc.) and continue triggering an exception for ambiguous values.

gtback commented 8 years ago

Thanks, @brlogan. One issues is that it is easiest to implement it for all UnsignedLongObjectPropertyType properties at the same time. A lot of the other fields are places where I would legitmately expect a "0x" prefix. I can see three basic solutions:

Implement custom logic for the size_in_bytes field.
If the value ends in bytes, remove the last 6 characters.
Strip all alphabetic characters (upper and lower) from the beginning and end of the string. Octal and hex values should be unaffected, since they start with 0, but it would handle all kinds of suffixes.

I'm leaning towards the third, but it could be overkill. Thoughts?

brlogan commented 8 years ago

I'm not sure that the third option is the better way to go. If we strip something like "MB" or "KB" from the end of the string and just use the numeric value as if it were bytes, then we are working with incorrect data. An error may be better in that case. Further, if you strip letters from the end, you may change the hex value. If someone is up to it, implementing some custom logic would be really nice, but with the frequency I've come across "bytes" in that field, I'd be happy with just the middle option.