elastic / opentelemetry-lib

Apache License 2.0
0 stars 6 forks source link

Set the metricset.period #9

Open ishleenk17 opened 1 month ago

ishleenk17 commented 1 month ago

The metricset.period was removed from the code as part of this PR. The discussion is where should this be present (in the library code or in APM/Processsor code)? Also what would be the apt way to calculate this.

axw commented 1 month ago

One possible option to investigate, which would not require maintaining state in the remapping code is described in https://github.com/elastic/opentelemetry-lib/pull/2#discussion_r1604370567

lahsivjar commented 1 month ago

I can confirm that with the APM-Server ingestion path the UIs (at least the network ones) are breaking without metricset.period set:

Screenshot 2024-05-17 at 3 00 32 PM

One possible option to investigate, which would not require maintaining state in the remapping code is described in https://github.com/elastic/opentelemetry-lib/pull/2#discussion_r1604370567

@axw That sounds like a good idea, I will give this a shot.

lahsivjar commented 1 month ago

@axw Looks like we can't use the StartTimestamp approach. As per the metrics datamodel:

  1. Gauges should have start timestamp set to the first possible moment a measurement could be recorded.
  2. Cumulative temporality sums will have the start timestamp set to when the reset happens.
  3. Delta temporality sums will work but unfortunately, we don't have any delta temporalities.
lahsivjar commented 1 month ago

Delta temporality sums will work but unfortunately, we don't have any delta temporalities.

@axw maybe there is some hope and we can get some delta temporality. As per my investigation so far, the metricset.period is required for host.network.* metrics. In the current mappings, we don't produce this metric. I have created an issue for adding these: https://github.com/elastic/opentelemetry-lib/issues/14. As per the definition these metrics are aggregates of bytes received across ALL network interfaces since the last metric collection due to which they require collection interval for interpretation, example: host.network.ingress.bytes.

IIUC, this means that the current approach used in system network metrics will not work as they consume cumulative + monotonic metric and produce counters. Instead, we would need to use a cumulative to delta processor in the otel-collector to convert these metrics into a delta temporality and then produce the host.network.* metrics. Doing this will also give us the collection interval by doing Timestamp#Millis - StartTimestamp#Millis (as suggested by @axw earlier).

CC: @tommyers-elastic @ishleenk17 (let me know if this doesn't makes sense)

lahsivjar commented 1 month ago

It looks like, on Elastic's end, the system.network.* metrics are expected to be a cumulative sum with monotonically increasing values, however, host.network.* metrics are expected to be delta cumulative sums. Both of these metrics are unfortunately derived from the same OTel metrics from the network scraper which is a cumulative sum with monotonically increasing values. This poses an interesting problem since we need both temporality together. I can think of a few ways we can do this:

  1. Do the cumulative to delta conversion in this library after processing it for the system.network.* metrics.
  2. Add a processor in the collector to produce a different metric for the host.network.*. The processor would aggregate all the network interfaces for the metrics produced by the network scraper in a different metric and then convert it to delta temporality which we could process.

I think 2 is a better option and we can package the required OTel config in a single configuration file.

axw commented 1 month ago

I don't love having to require any aggregation in the collector. Either it's on by default and may be a footgun for centralised collectors, or it requires opt-in and makes it harder to onboard.

If those are the only metrics where metricset.period is used, could we make a very targeted change to the UI to handle both OTel & Metricbeat network metrics? Then we can (a) forget about setting metricset.period, and (b) not have to do cumulative-to-delta conversion for this case.

If there are more cases that we also need to solve, then we would need a more general fix.

lahsivjar commented 1 month ago

I don't love having to require any aggregation in the collector. Either it's on by default and may be a footgun for centralised collectors, or it requires opt-in and makes it harder to onboard.

Hmm, didn't get this point. IIUC, hostmetricsreceiver won't be required for centralised collector but for on-edge collectors maintained by customers to monitor their hosts/nodes. We already have a bunch of non-optional metric that we require to be turned on so we would need to publish a recommended minimal config for hostmetrics to work anyway and we can add the aggregation processor their as well.

If those are the only metrics where metricset.period is used, could we make a very targeted change to the UI to handle both OTel & Metricbeat network metrics?

This would probably be the best approach given that we already have system.network.* metrics published and the UI could just do a rate over them and sum the rates for all available network interfaces for the host. However, since the system.network.* is a cumulative metric I am not sure if there are some other edge cases here for the UI.

axw commented 1 month ago

Hmm, didn't get this point. IIUC, hostmetricsreceiver won't be required for centralised collector but for on-edge collectors maintained by customers to monitor their hosts/nodes.

I was thinking of the case where the receiver is at the edge, and this processor is centralised. Like in apm-server, say. So I guess that's really just an argument for not doing option (1).

lahsivjar commented 1 month ago

Created a PoC to get the approach with a metrics transform processor to work: https://github.com/elastic/opentelemetry-lib/pull/16 (haven't validated the correctness of the final metrics yet).

The PoC uses the following processors to produce host.network.* metrics:

processors:
  metricstransform:
    transforms:
      - include: "^system\\.network\\.(io|packets)$$"
        match_type: regexp
        action: insert
        new_name: host.network.$${1}
        operations:
          - action: aggregate_labels
            label_set: [direction]
            aggregation_type: sum
  cumulativetodelta:
    include:
      metrics:
        - "^host\\.network\\.(io|packets)$$"
      match_type: regexp

This also allows us to set the metricset.period based on StartTimestamp and Timestamp.

Full working OTel configuration ```yaml receivers: hostmetrics: collection_interval: 10s scrapers: load: cpu: metrics: system.cpu.utilization: enabled: true system.cpu.logical.count: enabled: true memory: metrics: system.memory.utilization: enabled: true network: process: metrics: process.threads: enabled: true process.open_file_descriptors: enabled: true process.memory.utilization: enabled: true process.disk.operations: enabled: true processes: exporters: otlphttp: endpoint: http://192.168.0.44:8200 tls: insecure: true debug: verbosity: detailed processors: resourcedetection/system: detectors: ["system"] system: hostname_sources: ["os"] metricstransform: transforms: - include: "^system\\.network\\.(io|packets)$$" match_type: regexp action: insert new_name: host.network.$${1} operations: - action: aggregate_labels label_set: [direction] aggregation_type: sum cumulativetodelta: include: metrics: - "^host\\.network\\.(io|packets)$$" match_type: regexp service: pipelines: metrics: receivers: [hostmetrics] processors: [metricstransform, cumulativetodelta, resourcedetection/system] exporters: [otlphttp, debug] ```
axw commented 1 month ago

@lahsivjar nice. Perhaps we could: