Remove `event.ingested` processor from ingest pipelines

gsantoro commented 2 years ago

As part of a PR it has been brought to my attention that we shouldn't add event.ingested to ingest pipelines since that field is already added by a default fleet pipeline.

According to @andrewkroh here and here, we shouldn't add the field event.ingested in individual integrations since this fleet pipeline runs both in managed and standalone mode.

I would like to use this issue to agree on this topic and create some action points like:

add this suggestion to our internal docs so that we can refer to them in the future
create PR to remove event.ingested if necessary from ingest pipeline in integrations.

ruflin commented 2 years ago

I want to take this as a chance to take a step back. Why do we need event.ingested in the first place? My understanding is that it is useful for monitoring (ingestion delay) and for security purpose. On the second point, I assume @andrewkroh can add a few more bits on how it is used.

Both of the above are not package related but "centralised" concers. Also the addition of the data can only happen centrally. Based on this, I agree with not adding it to every package and have this as part of our docs.

As it is a centralised concern, who should be in control of it? Is it a global setting, is it per package or per data stream? One thing to consider is that timestamps in general have quite a bit of overhead as these cannot be compressed well (@jpountz might have here some more details).

Few questions that come to mind:

Should we allow users to turn on / off this feature?
Is the final pipeline the right place for this or are there better places like mappings, setting on a data stream?
Who is responsible for the mapping of event.ingested?
Does event.ingested have to be indexed or is stored enough?

@gsantoro If nobody objects, it seems we can move forward with your suggestions but I would like to keep this open to have a follow up discussions on the long term of event.ingested.

scunningham commented 2 years ago

I also am curious as to what we are using event.ingested for? It does seem useful for diagnostic reasons; it is interesting when there is a large lag between event creation and delivery. This may affect the efficacy of the SIEM as it is time window bound. May also demonstrate a clock skew issue.

However, does it necessarily have to be mapped? If it is not mapped, then the compression overhead concerns go away. We do still have pipeline overhead concerns.

ruflin commented 2 years ago

@gsantoro Is there any follow up issue or PR for your action points to track the progress?

@joshdover Pinging you on this one for awareness as event.ingested seems to be a more generic ingest challenge together with the final pipeline. @axw Is apm using event.ingested in any way?

gsantoro commented 2 years ago

Hello @ruflin , regarding the PR that created this discussion in the first place, it has now been merged as it was (without the event.ingested) as agreed.

I assume that if we agree to remove event.ingested from all the ingest pipeline, this would not happen in a single PR but would probably involve multiple teams according to the CODEOWNERS responsibilities.

andrewkroh commented 2 years ago

I also am curious as to what we are using event.ingested for?

These are the use cases that I am aware of.

It is used to prevent evasion of the detection rules ^1 that look at all data from the Agents for host masquerading. We use this as a trusted timestamp for a means of iterating over data in roughly the order in which it was ingested.
In general, the detection engine relies on event.ingested to deal with documents that are delayed ^3.
Diagnosing ingestion lag via analysis of differences between various timestamps.

andrewkroh commented 2 years ago

If it is not mapped, then the compression overhead concerns go away.

It is mapped as a date. The value does not contain subseconds to help improve compressibility. ^1

axw commented 2 years ago

@axw Is apm using event.ingested in any way?

No. I have used it in the past to calculate ingestion delay, but that was a one-off exercise, not something that is visualised in the product.

ruflin commented 1 year ago

The way I read the above comments, there are specific features which need event.ingested. But these are not features specific to integrations but in general to data ingestion. We can currently enable on data coming from Elastic Agent that uses integrations (my understanding is final pipeline is only applied for integrations) but if used, it would be ideal that it is applied on all data (at least all data coming in for the data stream naming scheme).

event.ingested is a feature that a user or a feature should be able to run on / off based on user needs. It can be turned on globally for evasion detection rules and more locally on some data streams for detecting ingest lag. This sounds a lot like an "ingest/fleet" feature @amitkanfer .

My current take is, we need to continue the discussion around this feature but with a more holistic view as part of ingest and is not tied to any of the integrations. Because of this, I'm closing this issue but I think we need a separate issue for further discussion.

@gsantoro @tommyers-elastic Can we ensure that the decision not adding event.ingested is documented? https://github.com/elastic/integrations/issues/4462#issuecomment-1298597103 @jsoriano Should we even have a check?

jsoriano commented 1 year ago

Should we even have a check?

A check to disallow adding event.ingested from integrations? This could be an option, yes.

ruflin commented 1 year ago

@jsoriano Yes. Or in a way that someone could still overwrite it but has to set a flag. But we can do that if it is actually needed.

elastic / integrations

Remove `event.ingested` processor from ingest pipelines #4462