elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.58k stars 24.36k forks source link

Add event.original setting to data stream #100320

Open ruflin opened 9 months ago

ruflin commented 9 months ago

event.original is an ECS field that can be useful in many scenarios, especially in the security context. Currently many integrations add it as part of their ingest pipeline. In Fleet, there is also the option to opt into having the field but it needs to be part of each integration. For more details on this see https://github.com/elastic/integrations/issues/4733

There are several problems with the current approach:

Instead of having to repeat the same logic in many places, I propose to add a setting to data streams if the field should be added or not, something like:

"data_stream": {
  "event.original": true
}

This means not the integration decides if event.original is captured, but it is set on the data stream. Many integrations can be used for observability or security. If the use case is security, the setting event.original can be turned on for all dataset without having to modify any integrations.

In the scenario of where data is routed, this would also ensure event.original contains the data before it was routed in case on the data stream that triggers the routing, event.original: true is set.

Expected behaviour

The behaviour of the setting would be as follow:

Change in integrations

It seems at the moment in integrations as we add event.original manually (1, 2) the integrations rename the message to event.original and then all the processing happens on event.original. I'm proposing to change this to keep all the processing on message as now integrations would always assume event.original might not be around.

Questions

Links

ruflin commented 9 months ago

I had a good conversation with @P1llus about this issue. Currently many integrations use event.original as the source for all processing and not the message field. To ensure after the processing the event.original field does not stick around and uses up lots of storage, the final pipeline has a remove process that checks for the tag preserve_original_event and removes the event.original if not set.

The above could mean, that there might be 2 config options needed:

I'm challenging if event.original should be used as the source for processing instead of messaging but it seems it is currently the default in many integrations and as @P1llus mentioned, there are also advantages that it could be used for reindexing if needed and the same pipeline still works.

@P1llus In the scenarios where the original event is not in message, how do integrations handle this at the moment? Pick the field where it is and put it into event.original?

elasticsearchmachine commented 9 months ago

Pinging @elastic/es-data-management (Team:Data Management)