elastic / integrations

Elastic Integrations
https://www.elastic.co/integrations
Other
194 stars 417 forks source link

Integration:panw_cortex_XDR Fetching by creation_time results in missing events #10518

Closed agmic closed 2 weeks ago

agmic commented 1 month ago

The Cortex XDR integration uses "creation_time" for sorting and retrieving alerts from Cortex XDR, but should use "server_creation_time" instead.

"creation_time" is the time the alert is created on the original host and is related to the time of the original event. "server_creation_time" is the time that the alert is seen/ingested by Cortex XDR.

We have observed missing alerts in the ingestion by elastic integration because of this discrepancy. For example if a host is offline when an alert is raised on the endpoint.

api docs: https://cortex-panw.stoplight.io/docs/cortex-xdr/branches/main/813e387002342-get-alerts-multi-events-v1 There is some discussion of the topic here: https://live.paloaltonetworks.com/t5/cortex-xdr-discussions/details-regarding-xdr-query-fields-server-creation-time-and/td-p/545596

relevant code: "creation_time" should be changed to "server_creation_time" https://github.com/elastic/integrations/blob/36eec7d998a8d6c3b01f81d293b14e3902eed329/packages/panw_cortex_xdr/data_stream/alerts/agent/stream/httpjson.yml.hbs#L44C1-L64C21

The cursor setting may also need to be changed from "detection_timestamp" to "local_insert_ts" which reflects the server_creation_time value.
https://github.com/elastic/integrations/blob/74a1d9076d20e687f7a982c9c37e47542bf713db/packages/panw_cortex_xdr/data_stream/alerts/agent/stream/httpjson.yml.hbs#L82

With regards to the timestamp of the event, we think the creation_time should be maintained here (event_timestamp from the alert). So no change to the pipeline is required.

I don't think this is a breaking change, but there is the risk of ingesting a few duplicate events after updating due to the cursor switch.

elasticmachine commented 1 month ago

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

efd6 commented 3 weeks ago

@agmic The API documentation says that server_creation_time is an allowed value for filters.[:].field, but not for sort.field (the only values that are listed there are severity and creation_time). I think that at least making the filter change would improve the situation, but I can see conditions where we would be selecting the wrong element to get the filter value from if the sort order does not match.

This would give us this

diff --git a/packages/panw_cortex_xdr/data_stream/alerts/agent/stream/httpjson.yml.hbs b/packages/panw_cortex_xdr/data_stream/alerts/agent/stream/httpjson.yml.hbs
index 1027cd4b68..6a89531894 100644
--- a/packages/panw_cortex_xdr/data_stream/alerts/agent/stream/httpjson.yml.hbs
+++ b/packages/panw_cortex_xdr/data_stream/alerts/agent/stream/httpjson.yml.hbs
@@ -51,13 +51,13 @@ request.transforms:
     target: body.request_data.filters
     value: |-
       {
-        "field": "creation_time",
+        "field": "server_creation_time",
         "operator": "gte",
         "value": [[ .cursor.next_ts
       }
     default: |-
       {
-        "field": "creation_time",
+        "field": "server_creation_time",
         "operator": "gte",
         "value": [[ mul (add (now (parseDuration "-{{initial_interval}}")).Unix) 1000 ]]
       }
@@ -81,7 +81,7 @@ response.pagination:
       fail_on_template_error: true
 cursor:
   next_ts:
-    value: "[[.last_event.detection_timestamp]]"
+    value: "[[.last_event.local_insert_ts]]"

 tags:
 {{#if preserve_original_event}}

Do you have additional information from elsewhere that shows that the server_creation_time value may be used in sort.field?

I'm not entirely sure why using the server creation time would prevent loss of events. Can you clarify this?

agmic commented 3 weeks ago

@efd6 No, sorry I don't see anywhere that it can be used to sort. Using it as the filter should do the trick, as the main issue is ensuring the event is fetched which should be accomplished by your edits.

wrt. loss of events. AIUI creation_time is based on the timestamp of the original event on the endpoint, and server_creation_time is when it hits the XDR platform. So if there is no connectivity from the endpoint to the platform for a period of time, the creation_time will fall outside of our cursor. I have one example in my environment where server_creation_time is three days after creation_time. Polling based on server_creation_time should bring this event in, whereas creation_time would miss it.

efd6 commented 3 weeks ago

Thanks.

This is the bit that I am confused by:

So if there is no connectivity from the endpoint to the platform for a period of time, the creation_time will fall outside of our cursor.

If we are updating our cursor on the basis of creation_time (indirectly via detection_timestamp) then surely we will not be advancing the cursor past events that we have not collected. At least not unless there is out-of-order propagation of events, but this would also be true for the case where we are using server_creation_time and local_insert_ts. Notwithstanding that, you say it works and so does the panw community discussion, so I'll send the change.

agmic commented 3 weeks ago

Maybe the confusion is that with the api, we are fetching from the XDR platform and not the individual device? As creation_time is set based on when the event occurs on a device, if the device is unable to upload this to the XDR platform in a timely manner (the upload time being when server_creation_time is set), the cursor risks being advanced beyond an event that was reported late. As server_creation_time is set by the platform we are polling, then AIUI, that risk is gone.

Again this is based on my understanding of the flow and what we have seen with the events. The palo alto documentation is not very explicit here.

efd6 commented 3 weeks ago

Thanks. Yeah I tried al sorts of (IMO) sane manipulations that could possibly end up with a data loss, and I could not think of one that would not also have the same effect when using the server's timestamp. But if we are happy with this, I'm OK with it.