[Enhancement] [M365 Defender] Break m365_defender.incident into two datasets for incidents and alerts & evidence

jvalente-salemstate commented 5 months ago

This is a pretty lengthy one but since it's a substantial change in how the integration works, I wanted to give as much information on why this change is necessary and provide as much info as possible for implementing the changes.

Summary

The current method of ingestion and processing alerts can create an excessive number of duplicate documents.
The current processing of evidence removes much of their value and usability.
These can be mitigated by creating a new stream/dataset for alerts and processing evidence as distinct documents, related to the alert.

This should decrease log volume--significantly for some incidents, make all three types of information much easier to work with, and provide a lot more value in using evidence for correlation, threat detection, and threat hunting.

Alternatively, the current methods may work if it is possible to use the timestamp from the cursor to drop alerts with a last update timestamp prior to that cursor's value. The evidence from the remaining alerts, after being split, would need to be split as well. I'm less sure, but removing json.lastUpdateDateTime from the fingerprint may also work, as it's that timestamp causing the duplication.

Problem

The m365_defender.incident data stream can be excessively noisy when handling incidents with more than few alerts. Information in alert evidence is being parsed in a way that that is not useful for analysis, while contributing to the creation of large documents combined with the number of documents.

For example, earlier this week, had an incident generated by M365 Defender and it has 95 alerts. With how the incidents are being processed, this has generated 16,654 events as any combination of alert properties changed. This was simply an informational incident, and if it was one where statuses, comments, and such were actively being worked on (vs resolving all at once), it could be several times to a an order of magnitude higher. A lot of these problems were both directly and indirectly touched up on in #8231

Why this is an issue

When pulling in incidents, the MS Graph API is used with ?$expand=alerts which returns a collection of incident objects, will include a collection of objects for every alert in the incident. Within that alert is yet another collection of evidence objects associated with that alert.

What the lastUpdateDateTime represents

The lastUpdateDateTime for an incident changes any time one of the alerts within (including properties of any alert evidence) are updated, statuses change, alerts are moved, incidents are merged, and so forth. Overall changes aren't always instant so if an incident is closed at 11:23 and all except one alert is closed between 11:23 and 11:25, the incident updated as expected and included when the next API call is run, at 11:25 or example. If the last alert is updated at 11:26, the incident's lastUpdateDateTime is also 11:26 and included in the next pull.

Alerts split into events

The call to MS Graph includes alerts that have not been updated since the last call. These are split into indivdual documents representing the alert. A single lalert event will just create one document. In my case, any minor change in even one out of 95 alerts resulted in 95 documents. If a new alert is added to the incident, the next pull has 96 instances of the same incident, and if that alert has its automated investigation status change, another 96 documents will be created. About 190 more than really needed if no other alert has updated.

Below is a snippet of the stream's httpjson.yml.hbs file:

request.url: {{request_url}}/v1.0/security/incidents

request.transforms:
  - set:
      target: url.params.$top
      value: {{batch_size}}
  - set:
      target: url.params.$skip
      value: 0
  - set:
      target: url.params.$filter
      value: 'lastUpdateDateTime ge [[.cursor.last_update_time]]'
      default: 'lastUpdateDateTime ge [[formatDate (now (parseDuration "-{{initial_interval}}"))]]'
  - set:
      target: url.params.$orderby
      value: 'lastUpdateDateTime asc'
  - set:
      target: url.params.$expand
      value: 'alerts'

  split:
    target: body.alerts
    keep_parent: true

Alert Evidence

Evidence represents a resourceType per MS docs, representing one of several kinds of entity. Unlike alerts, which are split out into documents, these have the dot_expander processor applied to them. The ingest pipeline iterates on these to append each property to a list for that field.

This provides details on a field level but strips them of any context to their overall entity. Because some entities may share some fields with others, and some being renamed, making it difficult to associate the values with their entity. For fields that are lists (roles, threats, tags) this even more difficult because these 3 objects may return a list with more than 3 roles or threats. Here's an example of a reported email message, having mailbox, message, and user evidence.

field	value
m365_defender.incident.alert.evidence.created_datetime	[2024-02-02 @ 16:45:49.823, 2024-02-02 @ 16:45:49.823, 2024-02-02 @ 16:45:49.823]
m365_defender.incident.alert.evidence.remediation_status	[none,none,none]
m365_defender.incident.alert.evidence.attachments_count	0
m365_defender.incident.alert.evidence.display_name	Laura Ipsim
m365_defender.incident.alert.evidence.primary_address	lipsim@qwerttyiop.qaz
m365_defender.incident.alert.evidence.user_account.azure_ad_user_id	[sid1, sid2]
email.subject	This is definitely not a phish

Here's the JSON representation of those same three objects to illustrate the difference.

{
  "@odata.type": "#microsoft.graph.security.mailboxEvidence",
  "createdDateTime": "String (timestamp)",
  "verdict": "String",
  "remediationStatus": "String",
  "remediationStatusDetails": "String",
  "roles": [
    "String"
  ],
  "tags": [
    "String"
  ],
  "primaryAddress": "String",
  "displayName": "String",
  "userAccount": {
    "@odata.type": "#microsoft.graph.security.userAccount",
    "accountName": "String",
    "azureAdUserId": "String",
    "displayName": "String",
    "domainName": "String",
    "userPrincipalName": "String",
    "userSid": "String"  
  }
},
{
  "@odata.type": "#microsoft.graph.security.userEvidence",
  "createdDateTime": "String (timestamp)",
  "verdict": "String",
  "remediationStatus": "String",
  "remediationStatusDetails": "String",
  "roles": [
    "String"
  ],
  "tags": [
    "String"
  ],
  "userAccount": {
    "@odata.type": "#microsoft.graph.security.userAccount",
    "accountName": "String",
    "azureAdUserId": "String",
    "displayName": "String",
    "domainName": "String",
    "userPrincipalName": "String",
    "userSid": "String"  
  }
},
{
  "@odata.type": "#microsoft.graph.security.analyzedMessageEvidence",
  "createdDateTime": "String (timestamp)",
  "verdict": "String",
  "remediationStatus": "String",
  "remediationStatusDetails": "String",
  "roles": [
    "String"
  ],
  "tags": [
    "String"
  ],
  "networkMessageId": "String",
  "internetMessageId": "String",
  "subject": "String",
  "language": "String",
  "senderIp": "String",
  "recipientEmailAddress": "String",
  "antiSpamDirection": "String",
  "deliveryAction": "String",
  "deliveryLocation": "String",
  "urn": "String",
  "threats": [
    "String"
  ],
  "threatDetectionMethods": [
    "String"
  ],
  "urls": [
    "String"
  ],
  "urlCount": "Integer",
  "attachmentsCount": "Integer",
  "receivedDateTime": "String (timestamp)",
  "p1Sender": {
    "@odata.type": "#microsoft.graph.security.emailSender",
    "emailAddress": "String",
    "displayName": "String",
    "domainName": "String"
  },
  "p2Sender": {
    "@odata.type": "#microsoft.graph.security.emailSender",
    "emailAddress": "String",
    "displayName": "String",
    "domainName": "String"
  }
}

There isn't any feasible way to recreate the original objects. There's no They're also adding a large amount of fields to events, which are already duplicated to sometimes extreme degrees.

For reference, here's default.yml:

  - fingerprint:
      fields:
        - json.id
        - json.lastUpdateDateTime
        - json.incidentWebUrl
        - json.createdDateTime
        - json.alerts.id
        - json.alerts.lastUpdateDateTime
      target_field: _id
      ignore_missing: true

  - foreach:
      field: json.alerts.evidence
      if: ctx.json?.alerts?.evidence instanceof List
      processor:
        dot_expander:
          field: '@odata.type'
          path: _ingest._value
          ignore_failure: true
          override: true
  - foreach:
      field: json.alerts.evidence
      if: ctx.json?.alerts?.evidence instanceof List
      processor:
        rename:
          field: _ingest._value.@odata.type
          target_field: _ingest._value.odata_type
          ignore_missing: true

  - foreach:
      field: json.alerts.evidence
      if: ctx.json?.alerts?.evidence instanceof List
      ignore_failure: true
      processor:
        append:
          field: email.to.address
          value: '{{{_ingest._value.recipient_email_address}}}'
          allow_duplicates: false
          ignore_failure: true
  - foreach:
      field: json.alerts.evidence
      if: ctx.json?.alerts?.evidence instanceof List
      ignore_failure: true
      processor:
        convert:
          field: _ingest._value.attachmentsCount
          target_field: _ingest._value.attachments_count
          type: long
          ignore_missing: true
          on_failure:
            - remove:
                field: _ingest._value.attachmentsCount
            - append:
                field: error.message
                value: '{{{_ingest.on_failure_message}}}'

Proposed Solution

Do not expand alerts with the list incidents API (or if done, use it to capture m365_defender.incident.alert.provider_alert_id, as a list, for correlation. Instead, make a separate API call for alerts via the /security/alerts_v2 endpoint. Place these into a dataset, such as m365_defender.alert or similar name.

This will still pull updated alerts with their evidence without the inclusion of non-updated alerts. Evidence could be split into separate documents. It may be necessary to include a field to mark the related alert for correlation.

The new dataset should also be optional, as these alerts and evidence captured in the m365_defender.events dataset if the tables, AlertInfo, AlertEvidence are exported to the event hub. I'd still leave the option for those not using an event hub or that simply prefer the API's use.

Additionally, a majority of the yml.hbs file is handling evidence fields and building out lists for each property. Having these as a document should simplify that and keep it to just renaming fields.

Issues to consider

This requires an additional permission for the MS Graph API, SecurityAlert.Read.All.
This doesn't completely resolve duplication. Incidents and alerts are still duplicated but only once each time any property or nested property is updated.

This level of duplication is not really a problem imho. it becomes a log of when an incident or alert were updated and getting the current status should be accomplished in Kibana or via query.
The mapping of ECS fields needs to be considered. event.kind: alert is straight forward, but evidence and incidents may need some thought. event.kind: enrichment seems like it'd be good for evidence.
The dashboards will need to be updated to reference the correct dataset. Some of the aggregations referencing evidence may need to be reworked.

elasticmachine commented 5 months ago

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

piyush-elastic commented 5 months ago

Hi @jvalente-salemstate , Thank you so much for you feedback. I agree that fetching both incidents and alerts makes the single data-stream noisy when handling incidents with more than few alerts. So, I am happy to inform you that in latest release(2.7.0), we have added support for /alerts dataset separately but didn't change anything in existing data-stream considering the impact on existing users. Please feel free to share more feedbacks on the same.

jvalente-salemstate commented 4 months ago

I've updated and monitored for a few days. It's looking like the volume is of alerts looks like the change ranges from no change for incident with a single alert (1 event in each datastream) and up to a 98% decrease for some larger incidents. For the last 4 or 5 days, it looks like it'd cut about 2/3 of the volume from the excess alert duplication

I would maybe include an option for when Incidents are enabled to enable/disable collecting alerts via that data stream. This should be on by default to not impact existing users while allowing folks using the alerts datastream to not have the duplicated alerts.

Interestingly, I didn't add SecurityAlert.Read.All and it seems to work fine with just SecurityIncident.Read.All. Logically this makes sense, because the alerts can be read anyhow via the incidents endpoint. It's odd Microsoft doesn't state this permission will work for /alerts_v2 though, or at least I can't find any documentation that says it should.

Having the alert evidence separate would still be helpful, but I get that the PR was made even before the issue was filed.

cpascale43 commented 1 month ago

Hiya @piyush-elastic, passing along a recent feature request that I think might be similar - FYI @Leaf-Lin in case you want to share any extra context.

Feature requests:

M365 Alerts can be “duplicated”, this is an issue with the Rest API itself, which represents each record as a historical timeline instead of an object.

We believe a “latest” transform grouping by the alert ID would fix this largely. The general issue we see is AutoIR (Microsoft “response actions”) closing cases a second or so after they are created. If we had the latest data, we could simply exclude resolved alerts from generating alerts within the security app.

This would still leave the issue where the Elastic case could de-sync from the M365 case, but for the major issue noted above, this is a minor issue.

M365 data could improve the use of ECS fields, for example, host.name is rarely populated, even if multiple related.hosts are identified.

elastic / integrations