elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.54k stars 24.35k forks source link

Add setting for event.ingested to data stream #100324

Open ruflin opened 9 months ago

ruflin commented 9 months ago

Fleet currently installs a final pipeline with a script processor to add event.ingested field. This script does some rouding to make it more stroage efficient. The usage of event.ingested is either for debugging use cases or to find drift.

The final pipeline does a few other things and is only available for datasets that have an integration. Any other dataset that is created under the data stream naming scheme does not have the same feature except a user would manually add the final pipeline.

I'm proposing to make event.ingested instead a feature of a data stream. The feature could either enabled through a setting or a tag on the data. The setting could look as following:

"data_stream": {
  "event.ingested": true
}

If it is set to true, the last thing a data stream does before persisting it on disk would be adding the current timestamp, potentially with rounding. If the data flows through routing pipelines, only the last data stream which has this enabled will add it before ingestion.

For debugging purpose of a single agent, it should be possible to also have this feature triggered by data. For example if an event comes in with _record_event_ingested the data stream would automatically add the field even if the above config is not enabled.

As part of this feature, it is important to also consider the default mapping for event.ingested. More discussions can be found here: https://github.com/elastic/integrations/issues/4894

Links

elasticsearchmachine commented 9 months ago

Pinging @elastic/es-data-management (Team:Data Management)

joegallo commented 8 months ago

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/breaking_50_mapping_changes.html#_literal__timestamp_literal_and_literal__ttl_literal adding a link to this for my own notes

dakrone commented 8 months ago

@ruflin can you explain a little bit more of the backstory for this? We used to have something like this in the _timestamp meta field, which was deprecated and removed due to having the ability to do the same thing with an ingest pipeline. This sounds a little bit like going back to the same behavior as _timestamp. I also know there are some other features that you've been brainstorming, such as automatically generating @timestamp if it's missing from the document during index item. I don't want to repeat the path of having a feature in ES, removing it as something that can be done in a pipeline, then re-adding it back to ES as a built-in.

A question about doing it per-document, is the reason for this to detect and understand skew? Or is there a different use for this ingested timestamp?

ruflin commented 8 months ago

@dakrone For more use cases, have a look here: https://github.com/elastic/integrations/issues/4894#issuecomment-1762140488

As you point out, all these things can be done in an ingest pipeline. The problem we are facing, we want it to work out of the box for our users for all data coming into the data stream naming scheme and we also do not want users to be able to break things. It starts to be a pattern that we want to use different defaults for Observability data and have additional logic. Would it be possible to potentially decouple these features somehow into a plugin? We hook into a data stream extension point instead of having to modify data streams directly.

sophiec20 commented 1 month ago

I am a big fan of event.ingested.

Things that analyze recently ingested data (e.g. rules, ml analysis, transforms, entity identification) benefit from knowing event.ingested. If these analytical processes cannot identify recently ingested data, then for data which arrives out-of-time-order, they either have to a) ignore out-of-order data, or b) query it late (e.g. wait for 5-10 mins) and ignore anything older or c) re-process recent data and manage dups (heavy resources) or d) elasticsearch would need to identify changes (incl for CCS and at scale) or e) fail to ingest out-of-order data (i.e. ignore it) or f) something else (ideas welcome) which probably uses much more compute and memory resources than storing one extra field.

I think it would be beneficial if we could make this simpler to configure and setup.