STIXProject / specifications

DRAFT STIX specification documents for version 1.2
20 stars 7 forks source link

Stix Difficulties: Generate Object ID from hashed Object contents #62

Open terrymacdonald opened 8 years ago

terrymacdonald commented 8 years ago

PROBLEM

There has been a few people who have mentioned that they would like to create Object IDs from a hash of the Object contents. The argument is that this would help during deduplication of CybOX Objects, as the content would be the same if multiple Objects were detected, and duplication would be easy to detect.

POTENTIAL ANSWER

Current list consensus seems to be that this should be permitted as a way of generating the ID, but that this shouldn’t be mandated as the only way that it is generated. It was posited that some lower powered devices may not have enough processing power to be able to generate a hash, and therefore mandating hash generation of the data would exclude them.

I believe that this should be mandated, as it provides a quick way of determining if the content was inadvertently modified during transit. As the hash is not a HMAC it does not provide malicious tampering detection (although this change would allow it to be supported in the future).

jmgnc commented 8 years ago

Are you saying that it should be mandated that the object id be a hash of the data?

I disagree that embedded devices don't have enough power to generate a hash. If the device is sub-100MHz, it's doubtful to be speaking STIX directly, and if it is, hashing the object contents isn't that expensive.

The hardest part of this is defining the correct serialization method for how to hash the data (due to whitespace issues, etc) such that when it gets reformatted, that the hash does not change.

Requiring the object id be the hash of the contents seems to break the ability to update some of the higher level objects w/o having to go regenerate all the lower level objects, which could create a massive cascading issue of updates.

terrymacdonald commented 8 years ago

Hi John,

It only breaks updates of objects if we still allow use of the Incremental Update mechanism. The Incremental Update mechanism requires the Object IDs to stay the same, and the timestamp to change. I've proposed in Issue #64 that we only allow Major Updates, and stop using Incremental Updates. This will ensure that all updates explicitly relate themselves to the previous version of the object, removing all abiguity, and allowing us to generate the ID based on the content. This will also allow us to use the Object ID as a form of checksum to make sure the data within the Object maps to the ID it has. It will mean that no-one will be able to modify data within the Object with a particular Object ID in transit.

In other words, if every object shared is immutable (via the ID be related to the content) then we can avoid some of the problems we currently have.

Cheers Terry MacDonald

On 3 December 2015 at 11:58, John-Mark Gurney notifications@github.com wrote:

Are you saying that it should be mandated that the object id be a hash of the data?

I disagree that embedded devices don't have enough power to generate a hash. If the device is sub-100MHz, it's doubtful to be speaking STIX directly, and if it is, hashing the object contents isn't that expensive.

The hardest part of this is defining the correct serialization method for how to hash the data (due to whitespace issues, etc) such that when it gets reformatted, that the hash does not change.

Requiring the object id be the hash of the contents seems to break the ability to update some of the higher level objects w/o having to go regenerate all the lower level objects, which could create a massive cascading issue of updates.

— Reply to this email directly or view it on GitHub https://github.com/STIXProject/specifications/issues/62#issuecomment-161481501 .

jmgnc commented 8 years ago

There is still the how to specify the format for the data for the hash. We'd have to normalize time stamps to UTC, or objects that different only by time zone offset (but have same UTC time) would have different hashes, and you can't just feed JSON or XML into a hash function due to the fact that both formats allow whitespace in locations that do not effect the meaning. Specifying this and implementing this is a huge pain.

For example, look at the XML signing tools for the pain to get hashing done.