fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.85k stars 1.58k forks source link

"Untyped" metric type is causing invalid temporality errors #7703

Closed Joufu closed 11 months ago

Joufu commented 1 year ago

Bug Report

Describe the bug "Untyped" type metrics are not being converted to OpenTelemetry supported type using OpenTelemetry output. We are facing an issue where we use Fluent-bit to scrape node_exporter and export the metrics to OpenTelementry Collector stack. Recently we modified some node_exporter collectors and we saw this error popping in Otel Collector logs:

    Exporting failed. The error is not retryable. Dropping data.    {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite/minithanos", "error": "Permanent error: invalid temporality and type combination for metric \"node_vmstat_kswapd_high_wmark_hit_quickly\"; invalid temporality and type combination for metric \"node_vmstat_kswapd_inodesteal\"; invalid temporality and type combination for metric \"node_vmstat_kswapd_low_wmark_hit_quickly\"; invalid temporality and type combination for metric \"node_vmstat_nr_dirty\"; invalid temporality and type combination for metric \"node_vmstat_nr_dirty_background_threshold\"; invalid temporality and type combination for metric \"node_vmstat_nr_dirty_threshold\"; invalid temporality and type combination for metric \"node_vmstat_nr_writeback\"; invalid temporality and type combination for metric \"node_vmstat_nr_writeback_temp\"; invalid temporality and type combination for metric \"node_vmstat_oom_kill\"; invalid temporality and type combination for metric \"node_vmstat_pgfault\"; invalid temporality and type combination for metric \"node_vmstat_pgmajfault\"; invalid temporality and type combination for metric \"node_vmstat_pgpgin\"; invalid temporality and type combination for metric \"node_vmstat_pgpgout\"; invalid temporality and type combination for metric \"node_vmstat_pgscan_direct\"; invalid temporality and type combination for metric \"node_vmstat_pgscan_direct_throttle\"; invalid temporality and type combination for metric \"node_vmstat_pgscan_kswapd\"; invalid temporality and type combination for metric \"node_vmstat_pgsteal_direct\"; invalid temporality and type combination for metric \"node_vmstat_pgsteal_kswapd\"; invalid temporality and type combination for metric \"node_vmstat_pswpin\"; invalid temporality and type combination for metric \"node_vmstat_pswpout\"; Permanent error: Permanent error: remote write returned HTTP status 409 Conflict; err = %!w(<nil>): store locally for endpoint : add 215 series: label set contains a label with empty name or value\n", "errorCauses": [{"error": "Permanent error: invalid temporality and type combination for metric \"node_vmstat_kswapd_high_wmark_hit_quickly\"; invalid temporality and type combination for metric \"node_vmstat_kswapd_inodesteal\"; invalid temporality and type combination for metric \"node_vmstat_kswapd_low_wmark_hit_quickly\"; invalid temporality and type combination for metric \"node_vmstat_nr_dirty\"; invalid temporality and type combination for metric \"node_vmstat_nr_dirty_background_threshold\"; invalid temporality and type combination for metric \"node_vmstat_nr_dirty_threshold\"; invalid temporality and type combination for metric \"node_vmstat_nr_writeback\"; invalid temporality and type combination for metric \"node_vmstat_nr_writeback_temp\"; invalid temporality and type combination for metric \"node_vmstat_oom_kill\"; invalid temporality and type combination for metric \"node_vmstat_pgfault\"; invalid temporality and type combination for metric \"node_vmstat_pgmajfault\"; invalid temporality and type combination for metric \"node_vmstat_pgpgin\"; invalid temporality and type combination for metric \"node_vmstat_pgpgout\"; invalid temporality and type combination for metric \"node_vmstat_pgscan_direct\"; invalid temporality and type combination for metric \"node_vmstat_pgscan_direct_throttle\"; invalid temporality and type combination for metric \"node_vmstat_pgscan_kswapd\"; invalid temporality and type combination for metric \"node_vmstat_pgsteal_direct\"; invalid temporality and type combination for metric \"node_vmstat_pgsteal_kswapd\"; invalid temporality and type combination for metric \"node_vmstat_pswpin\"; invalid temporality and type combination for metric \"node_vmstat_pswpout\""}, {"error": "Permanent error: Permanent error: remote write returned HTTP status 409 Conflict; err = %!w(<nil>): store locally for endpoint : add 215 series: label set contains a label with empty name or value\n"}], "dropped_items": 999}

The metrics from the error log, all of them have type "untyped". It seems that when Fluent-bit sending out "untyped" type metrics is not converting them to supported OTel types (OpenTelemetry Collector converts untyped to gauge). Using OTel Collector instead of Fluent-Bit , we do not face this problem as untyped is converted to gauge.

Our infra looks like this:

node_exporter <--[Input]Prometheus Scrape Metrics --> [Output]OpenTelemetry --> OTel Collectors --> Thanos

To Reproduce

  1. Run node_exporter (with config example bellow)
  2. Run Fluent-Bit (with config example bellow)
  3. Run OTel Collector Contrib (With Bellow example)
  4. Export metrics to any TSDB

Expected behavior Untyped type metrics should be converted to OpenTelemetry supported types.

Your Environment

[INPUT] name prometheus_scrape host localhost port 19100 tag node_exporter metrics_path /metrics scrape_interval 15s

[OUTPUT] Name opentelemetry Match exporter Host os-metrics-nop Port 443 Metrics_uri /v1/metrics Tls On Tls.verify Off add_label platform linux add_label environment test add_label resource

OpenTelemetry Collectors:
extensions:
  health_check: {}

receivers:
  otlp:
    protocols:
      grpc:
      http:

  prometheus:
    config:
      scrape_configs:
        - job_name: otel-collector-metrics
          scrape_interval: 60s
          static_configs:
            - targets: ["localhost:8888"]

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 25

  batch:
    send_batch_size: 1000
    send_batch_max_size: 1500
    timeout: 200ms

exporters:
  prometheusremotewrite/minithanos:
    endpoint: "http://<redacted>:9093/api/v1/receive"
    target_info:
      enabled: false
    resource_to_telemetry_conversion:
      enabled: false

  prometheus/2:
    endpoint: "localhost:9200"
    send_timestamps: true
    enable_open_metrics: false
    resource_to_telemetry_conversion:
      enabled: false

service:
  pipelines:
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite/minithanos, prometheus/2]

  extensions: [health_check]

* Environment name and version (e.g. Kubernetes? What version?):
 Node_exporter and Fluent-Bit running on RHEL8 VM, OTel in OpenShift 4.11
* Operating System and version: RHEL8, OpenShift 4.11
* Filters and plugins: Prometheus scrape input, OpenTelemtry output

**Additional context**
As mentioned before, we have tried replacing Fluent-Bit with OTel Collector and we do not see the same issue. 
Example metrics from node_exporter which we use which have "untyped" type:

TYPE node_netstat_Icmp_InErrors untyped

TYPE node_netstat_TcpExt_DelayedACKs untyped

TYPE node_netstat_TcpExt_ListenDrops untyped

TYPE node_netstat_TcpExt_ListenOverflows untyped

TYPE node_netstat_TcpExt_SyncookiesFailed untyped

TYPE node_netstat_TcpExt_SyncookiesRecv untyped

TYPE node_netstat_TcpExt_SyncookiesSent untyped

TYPE node_netstat_TcpExt_TCPSynRetrans untyped

TYPE node_netstat_TcpExt_TCPTimeouts untyped

TYPE node_netstat_Tcp_ActiveOpens untyped

TYPE node_netstat_Tcp_CurrEstab untyped

TYPE node_netstat_Tcp_InErrs untyped

TYPE node_netstat_Tcp_InSegs untyped

TYPE node_netstat_Tcp_OutRsts untyped

TYPE node_netstat_Tcp_OutSegs untyped

TYPE node_netstat_Tcp_PassiveOpens untyped

TYPE node_netstat_Tcp_RetransSegs untyped

TYPE node_netstat_UdpLite_InErrors untyped

TYPE node_netstat_Udp_InErrors untyped

TYPE node_vmstat_kswapd_high_wmark_hit_quickly untyped

TYPE node_vmstat_kswapd_inodesteal untyped

TYPE node_vmstat_kswapd_low_wmark_hit_quickly untyped

TYPE node_vmstat_nr_dirty untyped

TYPE node_vmstat_nr_dirty_background_threshold untyped

TYPE node_vmstat_nr_dirty_threshold untyped

TYPE node_vmstat_nr_writeback untyped

TYPE node_vmstat_nr_writeback_temp untyped

TYPE node_vmstat_oom_kill untyped

TYPE node_vmstat_pgfault untyped

TYPE node_vmstat_pgmajfault untyped

TYPE node_vmstat_pgpgin untyped

TYPE node_vmstat_pgpgout untyped

TYPE node_vmstat_pgscan_direct untyped

TYPE node_vmstat_pgscan_direct_throttle untyped

TYPE node_vmstat_pgscan_kswapd untyped

TYPE node_vmstat_pgsteal_direct untyped

TYPE node_vmstat_pgsteal_kswapd untyped

TYPE node_vmstat_pswpin untyped

TYPE node_vmstat_pswpout untyped

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 11 months ago

This issue was closed because it has been stalled for 5 days with no activity.