Open ishleenk17 opened 1 month ago
One possible option to investigate, which would not require maintaining state in the remapping code is described in https://github.com/elastic/opentelemetry-lib/pull/2#discussion_r1604370567
I can confirm that with the APM-Server ingestion path the UIs (at least the network ones) are breaking without
metricset.period
set:
One possible option to investigate, which would not require maintaining state in the remapping code is described in https://github.com/elastic/opentelemetry-lib/pull/2#discussion_r1604370567
@axw That sounds like a good idea, I will give this a shot.
@axw Looks like we can't use the StartTimestamp
approach. As per the metrics datamodel:
Delta temporality sums will work but unfortunately, we don't have any delta temporalities.
@axw maybe there is some hope and we can get some delta temporality. As per my investigation so far, the metricset.period
is required for host.network.*
metrics. In the current mappings, we don't produce this metric. I have created an issue for adding these: https://github.com/elastic/opentelemetry-lib/issues/14. As per the definition these metrics are aggregates of bytes received across ALL network interfaces since the last metric collection due to which they require collection interval for interpretation, example: host.network.ingress.bytes.
IIUC, this means that the current approach used in system network metrics will not work as they consume cumulative + monotonic metric and produce counters. Instead, we would need to use a cumulative to delta processor in the otel-collector to convert these metrics into a delta temporality and then produce the host.network.*
metrics. Doing this will also give us the collection interval by doing Timestamp#Millis - StartTimestamp#Millis
(as suggested by @axw earlier).
CC: @tommyers-elastic @ishleenk17 (let me know if this doesn't makes sense)
It looks like, on Elastic's end, the system.network.*
metrics are expected to be a cumulative sum with monotonically increasing values, however, host.network.*
metrics are expected to be delta cumulative sums. Both of these metrics are unfortunately derived from the same OTel metrics from the network scraper which is a cumulative sum with monotonically increasing values. This poses an interesting problem since we need both temporality together. I can think of a few ways we can do this:
system.network.*
metrics.host.network.*
. The processor would aggregate all the network interfaces for the metrics produced by the network scraper in a different metric and then convert it to delta temporality which we could process.I think 2
is a better option and we can package the required OTel config in a single configuration file.
I don't love having to require any aggregation in the collector. Either it's on by default and may be a footgun for centralised collectors, or it requires opt-in and makes it harder to onboard.
If those are the only metrics where metricset.period
is used, could we make a very targeted change to the UI to handle both OTel & Metricbeat network metrics? Then we can (a) forget about setting metricset.period
, and (b) not have to do cumulative-to-delta conversion for this case.
If there are more cases that we also need to solve, then we would need a more general fix.
I don't love having to require any aggregation in the collector. Either it's on by default and may be a footgun for centralised collectors, or it requires opt-in and makes it harder to onboard.
Hmm, didn't get this point. IIUC, hostmetricsreceiver
won't be required for centralised collector but for on-edge collectors maintained by customers to monitor their hosts/nodes. We already have a bunch of non-optional metric that we require to be turned on so we would need to publish a recommended minimal config for hostmetrics to work anyway and we can add the aggregation processor their as well.
If those are the only metrics where metricset.period is used, could we make a very targeted change to the UI to handle both OTel & Metricbeat network metrics?
This would probably be the best approach given that we already have system.network.*
metrics published and the UI could just do a rate over them and sum the rates for all available network interfaces for the host. However, since the system.network.*
is a cumulative metric I am not sure if there are some other edge cases here for the UI.
Hmm, didn't get this point. IIUC, hostmetricsreceiver won't be required for centralised collector but for on-edge collectors maintained by customers to monitor their hosts/nodes.
I was thinking of the case where the receiver is at the edge, and this processor is centralised. Like in apm-server, say. So I guess that's really just an argument for not doing option (1).
Created a PoC to get the approach with a metrics transform processor to work: https://github.com/elastic/opentelemetry-lib/pull/16 (haven't validated the correctness of the final metrics yet).
The PoC uses the following processors to produce host.network.*
metrics:
processors:
metricstransform:
transforms:
- include: "^system\\.network\\.(io|packets)$$"
match_type: regexp
action: insert
new_name: host.network.$${1}
operations:
- action: aggregate_labels
label_set: [direction]
aggregation_type: sum
cumulativetodelta:
include:
metrics:
- "^host\\.network\\.(io|packets)$$"
match_type: regexp
This also allows us to set the metricset.period
based on StartTimestamp
and Timestamp
.
@lahsivjar nice. Perhaps we could:
host.network.(io|packets)
metrics that have cumulative temporality with a link to those docs, and skip remapping for them
The metricset.period was removed from the code as part of this PR. The discussion is where should this be present (in the library code or in APM/Processsor code)? Also what would be the apt way to calculate this.