Schema for metrics - Githubissues

axw commented 5 years ago

In https://github.com/elastic/beats/issues/11836 we identified the need to maintain some kind of central definition of metrics that are implemented by both Beats and Elastic APM. The APM agents now produce a subset of system.cpu.*, system.memory.*, and system.process.* metrics defined by Metricbeat. The APM agents are now starting to define language/runtime-specific metrics, and Metricbeat will also start producing at least some of these.

We would like to propose extending ECS to cover metrics.

Organisation

I propose we introduce new field sets for:

JVM (jvm.*)
Go (golang.*)
Node.js (nodejs.*)
.NET (dotnet.*)
Python (python.*)
Ruby (ruby.*)

These field sets may contain a mixture of metrics (e.g. heap usage, GC pause timings) and runtime details (e.g. JVM implementation name and version).

Existing system metrics should be renamed to fit into the existing ECS field sets, e.g.

system.cpu.* -> host.cpu.*
system.memory.* -> host.memory.*
system.process.* -> process.*

Naming conventions

Ideally, we would extend the conventions documentation to cover metric naming. Specifically, we should ensure the consistent use of units in metric names, to ensure metrics are discoverable and self-documenting, both of which are important in Metrics Explorer.

One challenge here is that some existing metrics have inconsistent naming. For example:

system.diskio.read.count, system.process.cpu.user.ticks (we should either pick .reads or .ticks.count)
system.process.cpu.total.value (.value isn't meaningful, and there's no unit)
system.memory.total vs. system.memory.used.bytes (both are bytes, but only one says so in the name)

We should review https://github.com/elastic/beats/blob/master/docs/devguide/event-conventions.asciidoc#standardised-names, revising it for inclusion in the ECS conventions.

Key points:

all metrics must specify a unit. This means finding an alternative to the following rule from the Beats conventions, since "value" is not a unit:

If a field name matches the namespace used for nested fields, add .value to the field name.
units are not necessary SI, they may be domain-specific. e.g. "objects", "mallocs" are fine

Alternatively to including units in names, we could wait for https://github.com/elastic/elasticsearch/issues/31244 or https://github.com/elastic/elasticsearch/issues/33267. However, there are existing fields in ECS that include units (network.bytes, http.request.bytes, ...), so it may be best to not wait, and do that a future revision across the board.

Open questions:

What do we do about existing non-compliant naming? A few options:
- Rename in Elastic Stack 8.0. This may mean breaking existing dashboards, or introducing field aliases. This option is implied in the proposed change to splitting the system metrics into host.* and process.* field sets above.
- Include existing metrics as-is, with a TODO to make their naming consistent in the future.
- Don't include these existing metrics in ECS.

Metrics implementation guide

The definition of metrics is sometimes subtle and platform-specific; without providing a guide to this, it is easy for inconsistencies to arise in implementations. We should provide a detailed guide to calculating these metrics either in ECS, or in a companion document linked from ECS.

watson commented 5 years ago

Great write-up @axw 😃

These field sets may contain a mixture of metrics (e.g. heap usage, GC pause timings) and runtime details (e.g. JVM implementation name and version).

Could you elaborate on why the metrics would contain "runtime details" like names and versions?

system.diskio.read.count, system.process.cpu.user.ticks (we should either pick .reads or .ticks.count

Do you mean .ticks or ticks.count? If it's the latter, then please elaborate on why .ticks isn't enough.

I also just want to mention that in the cases where the metric that an APM agent wants to collect, exists across all languages and is counted in the same manner, we should strongly consider a shared namespace for it instead of having the same metric repeated under each language namespace. E.g. in the Node.js agent want to count the number of user and system CPU ticks for the process. This is a prime candidate for standardization and would be system.process.cpu.{user|system}.ticks under the existing system metrics.

axw commented 5 years ago

These field sets may contain a mixture of metrics (e.g. heap usage, GC pause timings) and runtime details (e.g. JVM implementation name and version).

Could you elaborate on why the metrics would contain "runtime details" like names and versions?

I just meant that the field sets (e.g. host.*, jvm.*) would not be restricted to only metrics or only non-metrics. For example, https://www.elastic.co/guide/en/ecs/current/ecs-network.html has network.bytes and network.name.

Do you mean .ticks or ticks.count? If it's the latter, then please elaborate on why .ticks isn't enough.

No, I meant rather that we should pick one format and be consistent. We're currently using "read.count" for one metric, and "ticks" for another metric. I tend to think we should use ".ticks".

I also just want to mention that in the cases where the metric that an APM agent wants to collect, exists across all languages and is counted in the same manner, we should strongly consider a shared namespace for it instead of having the same metric repeated under each language namespace. E.g. in the Node.js agent want to count the number of user and system CPU ticks for the process. This is a prime candidate for standardization and would be system.process.cpu.{user|system}.ticks under the existing system metrics.

:+1: In this specific example, I'd say those metric fields belong in the process field set.

ruflin commented 5 years ago

Thanks for this great writeup @axw . I think it's a good time to challenge the metricbeat naming convention and improve them. I would really like to have https://github.com/elastic/elasticsearch/issues/31244 but agree we should not be blocked by it and is a reason some fields in ECS already today have .bytes.

For the migration: We have the fields alias feature so I would directly start with the names we want in the future and potentially alias old fields (if possible).

To move this effort forward we could start with a PR for one specific environment like nodejs or Golang (or both). Haven specific fields to discuss will indirectly drive also the discussion about building / updating the convention. @axw @watson Willing to open PR's against https://github.com/elastic/ecs/tree/ecs-metrics ?

axw commented 5 years ago

For Go it's a little tricky, since we've got "golang" metrics coming from Metricbeat already. There's a PR ongoing for the Node.js agent which we can extract the new fields from: https://github.com/elastic/apm-agent-nodejs/pull/1021/files. I'll take a look at doing that later to get things started, unless @watson or @Qard particularly want to :)

watson commented 5 years ago

No, I meant rather that we should pick one format and be consistent. We're currently using "read.count" for one metric, and "ticks" for another metric. I tend to think we should use ".ticks".

Got'ya, It's just that you wrote .ticks.count which confused me. But I guess that was a typo right?

There's a PR ongoing for the Node.js agent which we can extract the new fields from: https://github.com/elastic/apm-agent-nodejs/pull/1021/files. I'll take a look at doing that later to get things started, unless @watson or @Qard particularly want to :)

If you could take it, it would be great 👍

webmat commented 5 years ago

Thanks for this detailed writeup, @axw! I love the attention given to improving the consistency of the naming of the various metrics.

One thing I'd like to challenge is the creation of metrics spaces per language. In ECS, I would much prefer having one place to nest runtime metrics, that we fill the best we can, depending on the runtime/language. The goal here is not to mix/consume the metrics of different runtimes together (although that opens up the possibility). In most cases, you'd of course filter for JVM stuff for a JVM-centric dashboard, for example (e.g. runtime.name:jvm). But having all runtimes fill the same common schema would enable the creation of common content or pipelines.

This may sound like a 180 degree change from this comment I made in April. But back then, I didn't expect this to be proposed to be added to ECS right away.

Right now I'm seeing a proposal to add what's essentially an unbounded amount of field sets to ECS, for the metrics of all the runtimes. I'd much rather have the common part of these metricsets be added to ECS in one common place, like runtime, and each runtime that has more than these metrics to expose, we add as custom fields in solutions.

One thing we have steered away from in the Elastic Common Schema is including brand or technology names (e.g. so it's container.*, not docker.*). There's no commonality in brand names :-)

axw commented 5 years ago

I'd much rather have the common part of these metricsets be added to ECS in one common place, like runtime, and each runtime that has more than these metrics to expose, we add as custom fields in solutions.

If we do that, we're kind of back to square one on https://github.com/elastic/beats/issues/11836. What we're trying to do here is define a common dictionary of metrics, many of which will be runtime-specific. e.g. "number of goroutines". Even "event loop delay" - while it may be common thing to a subset of languages, it's definitely not common to all so probably doesn't belong directly on "runtime".

If it were just APM implementing these metrics I'd be inclined to agree, we could just document the runtime-specific metrics separately. In fact, that's what we started out doing. Metricbeat will start gathering some of these metrics too.

@webmat Is the issue you have with "[adding] what's essentially an unbounded amount of field sets to ECS, for the metrics of all the runtimes" about the effect it'll have on the readability of the docs? (If so is it a matter of reorganising things somehow, to be nested under a common "runtime" namespace?) Or something else?

ruflin commented 5 years ago

I think we should stick to the plan for now to have one prefix for each runtime and we will figure out the generalisation later on. I still think we should do the same for kubernetes etc.

But there is the problem that @webmat mentioned about having in the end too many fields in the ecs template. I think it's time to come up with a solution here, for example introduce different parts of ECS and potentially have 1 template for each. So someone can only use basic if he wants or also get metrics.

In any case, I think this discussion should not block any progress to be made on the metrics branch as it having the fields already there will make the discussion more concrete.

roncohen commented 5 years ago

to summarize: we'll stage a proposal in the ECS metrics branch while we sort out the metrics for each runtime. For now we're going with individual namespaces for each runtime because the metrics are going to be quite different between runtimes. When there's something concrete for each runtime we can look at deriving common things and we'll discuss inclusion into ECS proper.

webmat commented 5 years ago

We've started a publicly viewable spreadsheet here

ruflin commented 5 years ago

One additional interesting note here that came out of https://github.com/elastic/ecs/pull/477: Prometheus adds the units to its metric names: https://prometheus.io/docs/practices/naming/#metric-names Perhaps to prevent future conflicts with objects vs keyword we could also use an underline notation like _seconds for the fields instead of the dot notation. Mainly thinking out loud, I still prefer what we have come up so far.

axw commented 5 years ago

@ruflin pointed out that we have more JVM metrics recorded for Logstash via its Node Stats API: https://www.elastic.co/guide/en/logstash/current/node-stats-api.html#jvm-stats. All very similar to what the APM Java agent is reporting, or will report in the future.

ruflin commented 4 years ago

Elasticsearch has a new histogram field in the work. This could become relevant for some of the metrics: https://github.com/elastic/elasticsearch/pull/48580

webmat commented 4 years ago

@axw @roncohen @watson Hey folks, I'd love to see this move forward again. Would be happy to help you on this :-)

ruflin commented 2 years ago

There is now a proposed convention for system metrics from otel: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/system-metrics.md

ruflin commented 2 years ago

I think it is really time we move forward here and come to a conclusion. We now also use some system metrics in the upcoming host view: https://github.com/elastic/kibana/pull/138173

I know this issue was initially opened in the context of apm and system metrics but to reduce scope, I propose to scale this down to only minimal set of system metrics. What @neptunian uses in https://github.com/elastic/kibana/pull/138173 seems like a good initial set.

@lalit-satapathy This would also be helpful to standardise on some host metrics from otel.

kgeller commented 2 years ago

Hi @ruflin! I think that introducing a minimal set of system metrics into ECS would be nice.

While not exactly the same, there's currently a stage 1 RFC for adding cgroup metrics into ECS.

ruflin commented 1 year ago

I opened a proposal PR for host metrics here: https://github.com/elastic/ecs/pull/2129 It is still in draft mode but I keep adding content and will convert it to a full RFC soonish.

dijkstrajesse commented 1 year ago

@ruflin Any updates to this? I'm working on a couple of projects for infra (network/file/system) monitoring and it would be very helpful for customers to refer to ECS metric field best practices instead of looking up system metrics over here: https://docs.elastic.co/en/integrations/system.

ruflin commented 1 year ago

@ChrsMark Can you chime in here? I think there are some moving parts here now related the merger with semconv.

ChrsMark commented 1 year ago

Hey! There is a WG taking place these weeks in Otel SemConv: https://github.com/open-telemetry/opentelemetry-specification/issues/3556 As part of this, we keep track of the System metrics alignment with Otel SemConv. I guess that's the best place to keep track of any changes.

elastic / ecs

Schema for metrics #474