CybOXProject / python-cybox

A Python library for parsing, manipulating, and generating CybOX content.
http://cybox.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
77 stars 42 forks source link

Python 3: ByteRuns #293

Open dilyanpalauzov opened 7 years ago

dilyanpalauzov commented 7 years ago

What is the right way to generate byte_runs with python3?

import cybox.common
import cybox.objects.file_object

file_object = cybox.objects.file_object.File()
file_object.byte_runs = cybox.common.ByteRuns()
file_object.byte_runs.byte_run = cybox.common.ByteRun()

with open('/bin/ls', 'rb') as f:
    file_object.byte_runs.byte_run[0].byte_run_data = f.read()

with open('test', 'wb') as f:
    f.write(file_object.to_xml())

creates <cyboxCommon:Byte_Run_Data>b'\x7fELF\x02\x01\x01\....\x00\x00\x00'</cyboxCommon:Byte_Run_Data>

but I would expect CDATA, no b'...' and no \x00.

gtback commented 7 years ago

Hi, @dilyanpalauzov, thanks for the question.

The Byte_Run_Data field is not one of the more frequently used fields in CybOX (in my experience), and is not restricted (it uses xs:anyType) so I don't know how people are using it in practice. XML doesn't allow native NULL bytes (\x00) even with the numeric reference &#0; . The latter is valid in a CDATA block, but I'm not clear whether it is interpreted as a NULL or the literal characters &, #, 0, and ;. So you might need to do something like base64-encoding the data anyway.

For types that we explicitly expect to need a CDATA wrapper (like an HTML email message), we explicitly add the wrapper in the bindings. We can certainly do that in for Byte Run Data, too, as well as adding any metadata to show how the data is encoded (for example, Base64). The b'...' wrapper should definitely not be there (and is an artifact of Python 3). Regardless of what we do to encode the data, we'll also need python-cybox to be able to decode the Byte Run XML back into raw data when parsing.

Have you run across Byte Run Data on CybOX "in the wild"? I would be curious what it looks like.

As a side note, if you are planning to represent the entire contents of a file, standard practice is to use the Artifact object instead. I realized that your example of reading the contents of a file could just be an example, though, and your question is valid regardless.

dilyanpalauzov commented 7 years ago

Why is Artifact better than Byte_Run for representing a file, which is already partially described in a FileObjectType?

The FileObjectType already contains many properties for a file, like hashes, bits per pixel (for pictures) and so on, and has Byte_Runs. What is Byte_Runs supposed to be used for?

gtback commented 7 years ago

The ByteRun type can be used to represent any subset of the bytes in a larger object. I don't recall the exact history, but the Artifact Object can be used as a standalone object; it does not need to be embedded within another Object. The Artifact object has a lot more options for specifying how the binary data is encoded; this is much more expressive than the open-ended Byte_Run_Data field.

There's certainly some duplicated functionality between the two. I've always tended to use the Artifact object, and haven't seen Byte_Run_Data being used.

I'd be happy to add better support for Byte_Run_Data in python-cybox, but would need to know how it actually gets used, to make sure we correctly serialized to/from XML.