elastic / ecs

Elastic Common Schema
https://www.elastic.co/what-is/ecs
Apache License 2.0
1.01k stars 419 forks source link

[meta] Add support for pipeline details in ECS #940

Open webmat opened 4 years ago

webmat commented 4 years ago

We'd like to define how to capture pipeline details in ECS.

Pipelines can come in many shapes

We'd like to define a way or at least guidance on how to capture information about various kinds of pipelines. The information folks usually want to track can fall into various categories, usually across each step of their pipelines:

Past discussions around this:

8, #40, #76, #154, #315, #453, #700, #730, #1027 , #1059

dainperkins commented 3 years ago

adding agent.ip, and maybe even agent.hostname seems like a good plan to allow for full identification of e.g. the host filebeat is running on in e.g. a syslog or API pull scenario.

While I've used agent to describe logstash (e.g. for netflow/syslog data) it seems like another field set to identify a "collector", and possibly a "queue" might make sense for end to end descriptions of all entities involved in data ingest?

Tho that would not easily solve the dual logstash setup with an agent (assuming the agent -> logstash hop is necessary) which would suggest an array of hosts/ips?

ypid-geberit commented 3 years ago

I am proposing something like agent.config_version to be added as well. Background: Pipeline config should be tracked using git. When the parsing is changed, a similar log event might then be parsed differently without a reason for the end user (in case they look at event.original and what fields are populated. It could be interesting, at least when developing/testing new log types/pipelines, to communicate the config/pipeline version to (test) users. As the value, I would suggest the output from git rev-parse --short HEAD as an example.

rsk0 commented 2 years ago

Articulated Arbitrarily-Structured Pipeline Details

Regarding timing, and considering just the example of arrival timestamps...

It's very important to be able to see latency throughout logging infrastructure. ECS currently offers only three timestamps for this purpose: event occurrence (@timestamp), record picked up by pipeline (event.created (a confusing name)), and record received into the data store (event.ingested). This small set of fields is too coarse and inflexible for best handling of sophisticated or large-scale operations.

In order to support arbitrary pipeline arrangements, what if we had two related arrays, one for juncture name and one for juncture arrival time?

@timestamp = 1660329591000
event.created = 1660329592000
foo.pipeline.junctures = [ "fluent bit source", "kafka amsterdam", "message classification processor", "logstash us-east-1" ]
foo.pipeline.arrivals = [ 1660329593000, 1660329594000, 1660329595000, 1660329596000 ]
event.ingested = 1660329597000

Alternatively, a list of objects.

@timestamp = 1660329591000
event.created = 1660329592000
foo.pipeline.times = [
  { "junction_name": "fluent bit source", "arrival_time": 1660329593000 },
  { "junction_name": "kafka amsterdam", "arrival_time": 1660329594000 },
  { "junction_name": "message classification processor", "arrival_time": 1660329595000 },
  { "junction_name": "logstash us-east-1", "arrival_time": 1660329595000 }
]
event.ingested = 1660329597000

Either way, I don't think Kibana would be able to visualize this data? Still, the information would be there for operators to use via other methods.

ECS's Values-Agnostic Philosophy

I think ECS development so far has been shying away from specification of values, the content in fields, and I understand this is important for fostering adoption by being open to various sources / implementations. (Let me know if I'm reading things correctly?) However, if that's a firm philosophical stance for ECS development, I suspect the issue of articulated pipeline details recording can't then be well handled via ECS per se.

I think there's an interoperability cost if ECS doesn't at least make recommendations about values. As a specific example, my company is having to devise its own idea of severity level values and likely won't do a better job than numerous companies in collaboration around ECS, and certainly can't be as effective as encouraging broad adoption of the scheme and thus interoperability as Elastic/ECS would be.

Non-binding recommendations about values could continue ECS's non-specificity tack, avoiding blocking adoption, while at the same time helping foster interoperability via a kind of "proto-standard". I'm thinking something like RFC "MAY" / "OPTIONAL".

log.level: The textual severity level of the original event. It can be whatever the heck you please. As one possibility, it could be -- no pressure -- it could be these values with these meanings, just some ideas, we don't know, whatever floats your boat, friend: "trace": blah "debug": blah "information": blah ...

Maybe, if ECS stewardship wants to maintain a firm stance on values agnosticism, we could benefit from a consortium of ECS-using companies developing a values recommendation addendum to ECS.

rsk0 commented 1 year ago

Timing: plain timestamps at each step, processing duration per step

Just a note that issue 1059 (roughly "fix event.ingested and event.created fields") rolls up into this issue -- to make sure those bugs get addressed when this issue is.