Stix Difficulties: Deduplication is difficult

PROBLEM

Recognizing that two different Objects (with different Object IDs but both containing the same internal data) are related is currently outside the scope of STIX. This lack of guidance on how to detect duplication makes it difficult for implementers to actually build systems.

In most of the current Relationship fields (e.g. RelatedIncidentType, RelatedTTPType – all except RelatedObjectType) there isn’t a defined relationship vocabulary, meaning that there is no standard way for a producer to explicitly identify that another object is a duplicate of an Object. Even the RelatedObjectType which does have a default vocab doesn’t include the ability to describe duplicate objects within the options.

Using the following two objects as an example:

<cybox:Object id="myorg:Object-15be6630-c2df-4bf9-8750-3f45ca9e19cf">
  <cybox:Properties xsi:type="AddressObj:AddressObjectType" category="ipv4-addr">
   <AddressObj:Address_Value>192.168.0.5</AddressObj:Address_Value>
  </cybox:Properties>
</cybox:Object>
<cybox:Object id="yourorg:Object-e2e89241-d858-4a29-b1ec-8155c3cd3278">
  <cybox:Properties xsi:type="AddressObj:AddressObjectType" category="ipv4-addr">
   <AddressObj:Address_Value>192.168.0.5</AddressObj:Address_Value>
  </cybox:Properties>
</cybox:Object>

Both Objects are from two different Organizations, but contain the same Address_Value of 192.168.0.5. They are effectively exact duplicates of each other but were discovered by different Organizations so should be linked in some way, but we should still be able to retain the fact they were seen independently.

Another related but slightly different scenario is where one Object is a superset of the other Object; where one Object contains only part of the other Object. In this document I’ll refer to this as partial duplication. As an example we have the two following CybOX Objects:

<cybox:Object id="myorg:Object-15be6630-c2df-4bf9-8750-3f45ca9e19cf">
<cybox:Properties xsi:type="AddressObj:AddressObjectType" category="ipv4-addr">
<AddressObj:Address_Value>192.168.0.5</AddressObj:Address_Value>
</cybox:Properties>
</cybox:Object>
<cybox:Object id="yourorg:Object-6e9d1bd0-e6ed-4ccd-bb8f-0ef0995b00a3">
<cybox:Properties xsi:type="AddressObj:AddressObjectType" category="ipv4-addr">
<AddressObj:Address_Value>10.10.1.2##COMMA##192.168.0.5</AddressObj:Address_Value>
</cybox:Properties>
</cybox:Object>

The ‘yourorg’ Object contains an IP address list, and the ‘myorg’ Object has just one IP address. Only one of the IPV4_Addresses within the ‘myorg’ Object is a duplicate of the ‘yourorg’ Object. The question is – how would an implementer show this relationship, and is it something that we need to be able to reflect within STIX somehow?

POSSIBLE ANSWER

There are a few ways of making de-duplication easier.

A potential solution to the exact duplication problem is to mandate that all STIX and CybOX Objects must have an Object ID generated from a combination of producer namespace and a hash of the contents, and to mandate this for all content production. In this way we can easily determine if objects have the same contents, as we would just need to compare the ‘hash’ part of the object ID. But this causes its own problem: it also breaks Incremental STIX Object Versioning (see “4. There are too many ways to update an Object (Versioning)” below). This is not too much of a problem, as we have recommended removing the Incremental Update mechanism, and only using the Major Update mechanism for changes to existing objects (which is basically a full reissue of a new Object with an explicit relationship with the old Object).

But detecting the same hashes in both Object’s IDs, we can then issue a Relationship Object containing a ‘duplicate_of’ relationship (or something similar), to explicitly note that the two objects refer to the same thing e.g :

(MyOrg Object) --[duplicate_of]-> (YourOrg Object)

Detecting partial duplication is trickier. The use of the shorthand ‘list’ property of Observables (often used for IP address/domain name lists) causes issues in this respect. This effectively creates a situation where one object with a single property matches one item in a list within the property of the other object. It’s a 1:N problem.

One way to rectify this is to deprecate the list shorthand, and force each description of a single item to require its own Object. In this way objects that currently have a list of 5 IPv4Addresses within them would in the future be generated as 5 objects, each with its own Object. This has implications of storage, handling, bandwidth etc. Using this method would remove the partial duplication problem (at least at the CybOX Object level).

Another way to rectify this is for each implementer to relate them within their system, independently of STIX. Each implementer would separate out the list of IP addresses into individual objects within their solution (outside of STIX), and would relate them in a way like this:

(MyOrg Object) ---> (192.168.0.1) ---> (YourOrg Object) | (192.168.0.1) <-----------|

Note: The two IP address objects are not STIX objects, but are additional objects added within the database implementation by the implementer.

You can see from this diagram that the yourorg object has multiple objects within it (i.e. a list) and that the implementer has pulled those out into individual items to track. This method has its own problems, as there is no way for a discovered relationship to be shared.

Another way to deal with this is for the creation of a ‘partial duplicate’ or ‘contains’ style relationship type, and to just associate the two objects together with a relationship object:

(MyOrg Object) --[partial_duplicate_of]-> (YourOrg Object)

STIXProject / specifications

Stix Difficulties: Deduplication is difficult #63

PROBLEM

POSSIBLE ANSWER