Debating use-cases/scenarios and performance implications of `threat.enrichments[].indicator` vs `threat.indicator`* n

This is a verbatim copy of an external conversation, migrated here for transparency

@rylnd writes: Hey @djptek, I was hoping we could continue the conversation from RFC Threat Integration Stage 3 - merged here:

my concern is that the indicators (in my use case) are really separate events

I'm struggling to wrap my head around this, some examples might be helpful. and encapsulating them all in a single event using an array:

reduces potential for aggregations across similar events
prevents aggregation across events e.g. where you might wish to compare threat.enrichments[].indicator.type against threat.indicator.type

I don't disagree that the nested threat.enrichments[] makes aggregations more difficult, but without specific examples of what you're trying to do it's hard to say what the correct approach/structure should be.

In my mind, an event with threat.enrichments means: "this event matched multiple indicators; the details of those indicators, and how they were matched are as follows." An event with threat.indicator is simply an indicator. An enriched event references one or more indicators, so aggregating across both types of documents implies to me that you're trying to do some additional enrichment/joining?

Regarding your question:

Did you consider the possibility of denormalising multiple indicators into separate events (with duplicate parent metadata) as an alternative to adding the threat.enrichments[]array?

If I'm understanding correctly, you're proposing that an event matching two indicators would actually be two events, one for each indicator? Since our documents represent userland events (and not e.g. a "matching" itself), creating one per indicator seems like an unnecessary duplication and a departure from the reality of the system (at least as I am envisioning it).

Since an event matches N indicators, threat.enrichments[] represents that relationship. We had at one point discussed collecting a separate index of "matches", but the burden of having to "join" to that index whenever alerts were retrieved was dropped in favor of the nested document structure. I think we still have opportunity to pursue this implementation, though, if that may help.

@djptek writes:

creating one per indicator seems like an unnecessary duplication

Denormalising data prior to storage in Elasticsearch is best practice in the majority of cases where you might want to run an aggregation. There is a lot of compression going on to mitigate the impact of duplication and any additional storage cost ought to be more than offset by faster aggregations

examples

for example, if I wanted to aggregate on threat.indicator.type or threat.indicator.ip and some of my events used this field while others used threat.enrichments.indicator.type or threat.enrichments.indicator.ip those fields aren't directly comparable.

I'm aware that related.ip exists and I've written some painless to copy the values there, however, that I do see as unnecessary duplication, rather than denormalisation, since by using a new field I'm effectively bypassing all the magic of columnar keyword compression that's going on in Elasticsearch

@rylnd writes: Understood about the denormalization best practice. However, from the security solution perspective, having an alert with two threat.enrichments[] is NOT equivalent to having two alerts, each with single indicators.

An alert represents a rule detecting something in the source data that merits investigation. Rules do not generate duplicate alerts, so per-indicator alerts would break this paradigm and many workflows.

for example, if I wanted to aggregate on threat.indicator.type or threat.indicator.ip and some of my events used this field while others used threat.enrichments.indicator.type or threat.enrichments.indicator.ip those fields aren't directly comparable.

I'm still unclear on what this aggregation would represent, since these are two types of documents; threat.indicator documents represent indicators, while documents with threat.enrichments represent any event that's been enriched with indicators.

@djptek writes: Thanks Ryland

However, from the security solution perspective, having an alert with two threat.enrichments[] is NOT equivalent to having two alerts, each with single indicators.

If the data for a unique alert with two threat.enrichments[] were to be denormalized, this doesn't equate to two alerts. There would be two Elasticsearch documents, each with a unique ID and each sharing a common Alert ID. So there is still only one alert.

In isolation, each document represents a unique indicator.

Aggregated on the basis of their common Alert ID, the set of documents represents the alert + indicators. My goal in suggesting this is to ensure that the system delivers best-of-class performance both at ingest and query/aggregation time.

Is it OK with you if I copy this Slack conversation wholesale into a new GitHub issue referencing the original? :elasticheart:

@rylnd

Is it OK with you if I copy this Slack conversation wholesale into a new GitHub issue referencing the original? Of course, that sounds fantastic!

@ebeahan @kgeller @jamiehynds @epixa

@djptek could you perhaps add a description of the use case, problem, and proposed solution?

Hi @djptek, adding my 5 cents to the discussion from the Threat Intelligence capabilities point of view. From the conversation, it is not very clear what precisely the proposed solution is, but I want to highlight the vital detail mentioned by @rylnd.an event with threat.enrichments means: "this event matched multiple indicators; the details of those indicators, and how they were matched are as follows." An event with threat.indicator is simply an indicator. In the slack conversation, you wrote two things that made me wonder if there is some conceptual misunderstanding on the difference between an IoC and an event (eg. alert) enriched with IoC(s)

In isolation, each document represents a unique indicator.

and

for example, if I wanted to aggregate on threat.indicator.type or threat.indicator.ip and some of my events used this field while others used threat.enrichments.indicator.type or threat.enrichments.indicator.ip those fields aren't directly comparable.

An alert enriched with an indicator is not an indicator itself. I would be interested to see what your use case is, but mixing IoCs and events enriched with IoCs in one aggregation isn't smth that I've observed before.

Also, mind that the discussion might be relevant to the Threat Intelligence Indicators view our team is building. For our data view we are using IoC documents (the ones where the data is in threat.indicator.*) but we don't need the events enriched with IoCs (threat.enrichement[]) to show up there. Ofc we also use event.type: indicator and event.category : threat to get only indicators for our data view, so if such denormalization will be necessary it should still be possible to distinguish indicators from other events by event.type. But as I mentioned first of all it would be great to understand the use case

Hi @maxcold please forgive my ignorance, what is "IoC"?

Also, while I do have a use case, which I'd be happy to share, that's kind of going off-topic, I think we can move forward more effectively by keeping this as abstract as possible.

My goal here is not to discuss what constitutes an alert, or an indicator - these are concepts which you have already defined - I'm just looking at how to map this data to Elasticsearch documents optimally for concurrent ingest, search and aggregation to ensure the best end-user experience

Hey, @djptek sorry for not introducing the acronym. IoC = Indicator of Compromise

My thinking was that it is important to get clarity on the concepts and relations between them before going into details of data modeling. Already in the issue title, I see smth that contradicts the reality that I observe - threat.enrichments[].indicator vs threat.indicator* n as for me there is no vs here, threat.indicator* n and threat.enrichments[].indicator are separate entities. Let me explain how I understood your proposal so you can check if I understood it correctly;

Let's say we have two Indicators of Compromise ingested by 2 integrations

{
  event: {
    category: 'threat',
    type: 'indicator'
  },
  threat: {
    indicator: {
      ip: '1.1.1.1'
    },
    feed: {
      name: '[Filebeat] AbuseCH MalwareBazaar'
    }
  }
}

and

{
  event: {
    category: 'threat',
    type: 'indicator'
  },
  threat: {
    indicator: {
      ip: '1.1.1.1'
    },
    feed: {
      name: '[Filebeat] Anomali'
    }
  }
}

We also have an Indicator Match Rule on destination.ip matching threat.indicator.ip

When there is a source event with a destination.ip: 1.1.1.1 your proposal is to create two documents sharing one alert_id

{
  alert_id: '123'
  threat: {
    indicator: {
      ip: '1.1.1.1'
    },
    feed: {
      name: '[Filebeat] AbuseCH MalwareBazaar'
    }
  }
}

and

{
  alert_id: '123'
  threat: {
    indicator: {
      ip: '1.1.1.1'
    },
    feed: {
      name:  '[Filebeat] Anomali'
    }
  }
}

instead of one alert with two indicators in the enrichments

{
  threat: {
    enrichments: [
    {
      indicator: {ip: '1.1.1.1'},
      feed: {name:  '[Filebeat] AbuseCH MalwareBazaar'}
    },
    {
      indicator: {ip: '1.1.1.1'},
      feed: {name:  '[Filebeat] Anomali'}
    },
    ]
  }
}

did I understand you correctly?

Hi @maxcold thanks for trying to map this out, you have understood my intention correctly with one exception: that the alerts were already generated externally in a 3rd party system, sorry if that wasn't clear before.

I'm working with the ECS format (Elastic Common Schema) to import 3rd party data describing Alerts which were already created by an external system to Elastic Stack for analysis.

ECS is an open schema. I am:

following the field descriptions in the ECS documentation
importing 3rd party alerts according to those descriptions
trying to reconcile this with ensuring the Elastic Stack is performant

What I am not trying to do in any way is to suggest how Elastic Security should do things, however, I believe what I am doing is a good analogue for how an external user might interpret the field descriptions in ECS to work with 3rd party threats/alerts/indicators etc. and my concern at this point is that:

applying the ECS documentation for threat.enrichments as written today to external data formats led me to encapsulate arrays of non-denormalised objects within single Elasticsearch documents
when I subsequently started considering dashboards/aggregations I decided this would work better denormalised

Disclaimer: I am not endorsing the 3rd party data formats. They are simply facts. I am trying to use ECS to reconcile these with Elastic Stack in the most optimal manner.

I'll give some examples below. One thing I'd like to keep in mind from the very beginning is the distinction between an Elasticsearch document and an Event which I can see you already understood by your examples :+1:

Without denormalising, a simplified version of the incoming alerts, mapped to the most "apparent" ECS threat fields might use three distinct formats, shown simplified below

// single alert has multiple sessions, each session has multiple activities each with corresponding IP + other properties
// these represent suspicious locations, or sessions
{
  "event": {
    "id": "1"
  },
  "threat": {
    "enrichments": [
      {
        "indicator": {
          "ip": "1.128.0.0"
        }
      },
      {
        "indicator": {
          "ip": "1.128.0.1"
        }
      }
    ]
  }
}

//  single alert has an array of IPs, representing a suspicious upload
POST ex/_doc
{
  "event": {
    "id": "2"
  },
  "threat": {
    "indicator": {
      "ip": "1.128.0.0"
    }
  }
}

//  single alert has an array of IPs, representing suspicious downloads. Initially, I put these in related.ip, where I probably should have created an array of enrichments, but leaving it here as it illustrates a worst case scenario of IPs in 3 different hierarchical levels
POST ex/_doc
{
  "event": {
    "id": "3"
  },
  "related.ip": [
    "1.128.0.0",
    "1.128.0.1"
  ]
}

After refactor & denormalise, these might be expressed as:

{"event":{"id":"1"},"threat":{"indicator":{"ip":"1.128.0.0"}}}

{"event":{"id":"1"},"threat":{"indicator":{"ip":"1.128.0.1"}}}

{"event":{"id":"2"},"threat":{"indicator":{"ip":"1.128.0.0"}}}

{"event":{"id":"3"},"threat":{"indicator":{"ip":"1.128.0.0"}}}

{"event":{"id":"3"},"threat":{"indicator":{"ip":"1.128.0.1"}}}

Other properties for documents representing events 1 and 3 will be duplicated - that's OK, there's a lot of compression going on and in Elasticsearch it's by design, as opposed to a materialised join in an RDBMS where this would be a bad thing.

I'm happy with this mapping, it feels clean, and should be performant, however, I'm no longer using the enrichments. Is that a good thing, or not?

Thanks for sharing more context! I'm probably not the best person to reason about if it's a good model or not as I haven't worked on the alerts and how they are enriched with indicators. I only have context about the indicators themselves. In the Security Solution "alerts enriched with indicators" and indicators themselves are two separate entities. I will still share my thoughts hoping they are relevant to the discussion. First of all, if we follow the current reality of the Security Solution your example would look more like

{"event":{category: 'threat', type: 'indicator'},"threat":{"indicator":{"ip":"1.128.0.0"}}}
{"event":{category: 'threat', type: 'indicator'},"threat":{"indicator":{"ip":"1.128.0.1"}}}
{"event":{"id":"1", <category, type, etc. of alert>},"threat":{"indicator":{"ip":"1.128.0.0"}}}
{"event":{"id":"1", <category, type, etc.  of alert>},"threat":{"indicator":{"ip":"1.128.0.1"}}}
{"event":{"id":"2", <category, type, etc.  of alert>},"threat":{"indicator":{"ip":"1.128.0.0"}}}
{"event":{"id":"3", <category, type, etc. of alert>},"threat":{"indicator":{"ip":"1.128.0.0"}}}
{"event":{"id":"3", <category, type, etc. of alert>},"threat":{"indicator":{"ip":"1.128.0.1"}}}

meaning that there are two indicators ingested from some source and then alerts created by an indicator match rule enriched with these indicators if a match is found in the source events. One thing for sure is that this data model I guess will require a complete redo of all things related to Alerts. Then the question is what is the goal of this particular discussion? If it is to find a solution to your problem at hand with 3rd party integration, maybe it is a good idea to model things the same way they are modeled currently in the Security Solution. Meaning if 3rd party has alerts enriched with indicators, create indicators in addition to alerts enriched with these indicators. Then we arrive at the concerns you have regarding the dashboards and aggregations and here is where I would really like to learn what dashboards and aggregations you are building as they might be very relevant to our team's scope. As mentioned we are working on the Threat Intelligence capabilities around Indicators of Compromise and want to learn more about everything related to it and your use case seems to be very new to us (building dashboards around indicators and alerts enriched with indicators) If the goal of the discussion is to propose and discuss the future state - then I think @rylnd is the right person to ask for feedback on the proposed model as he has much more context about alerts and enrichments

Thanks @maxcold

model things the same way they are modeled currently in the Security Solution

This is what I initially did using painless. This was only partially successful, since while the code worked fine and I could map the data to ECS, the resultant mappings were mutually incompatible with one another so I realised they were unlikely to benefit from any already-existing dashboards/visualisations & I was concerned about the impact on Elasticsearch of potential queries/aggs. It's readily apparent that behind the scenes, the target API is joining 3 tables with 1:n, n:n*n' relationships in one feed; 2 tables with 1:n in the second feed & then two more feeds. I can't influence the upstream design decisions that were made there, all I can do is be mindful of their impact on the stack & the pattern this suggests is to denormalise, so I did that.

what dashboards and aggregations you are building as they might be very relevant to our team's scope

my interpretation of the data may be naive - I'd be happy to do a share if you have the time.

your use case seems to be very new to us (building dashboards around indicators and alerts enriched with indicators)

I have a data feed with alerts. When I map the data within those alerts to ECS, it naturally falls in the threat fields. Then I visualise the data. There is no intention to do anything ground breaking here

@djptek happy to chat about your use case a bit more, if having a call works for you, feel free to schedule smth !

elastic / ecs

Debating use-cases/scenarios and performance implications of `threat.enrichments[].indicator` vs `threat.indicator`* n #2046