elastic / integrations

Elastic Integrations
https://www.elastic.co/integrations
Other
25 stars 437 forks source link

[Istio][Istiod Metrics]: Metrics are incorrectly dropped because of TSDS Dimension issue #11513

Open BenB196 opened 2 weeks ago

BenB196 commented 2 weeks ago

Integration Name

Istio [istio]

Dataset Name

istio.istiod_metrics

Integration Version

0.6.0

Agent Version

8.14.3

Agent Output Type

logstash

Elasticsearch Version

8.15.3

OS Version and Architecture

Container

Software/API Version

Istio 1.23.1

Error Message

No response

Event Original

No response

What did you do?

I recently was looking into an issue and noticed that Logstash was reporting a high number of document conflicts with Istio.

What did you see?

Istio Metrics pipeline incorrectly overrides the istio.istiod.labels.job value and causes a high number of document conflicts.

Here are 2 events that were considered "duplicates", but in reality, no events exist in Elastic that would have matched this if the job label wasn't overwritten from its original value.

[2024-10-24T20:55:10,747][WARN ][logstash.outputs.elasticsearch][elastic-agent][elastic_agent_elasticsearch_output] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-istio.istiod_metrics-private.default.production", :routing=>nil}, {"metricset"=>{"name"=>"collector", "period"=>10000}, "elastic_agent"=>{"version"=>"8.14.3", "id"=>"eb215fef-a3be-4c7e-99dd-4ef14fa09422", "snapshot"=>false}, "prometheus"=>{"pilot_k8s_reg_events"=>{"rate"=>0, "counter"=>122}, "labels"=>{"type"=>"EndpointSlice", "event"=>"add", "instance"=>"istiod.istio-system:15014", "job"=>"prometheus"}}, "@timestamp"=>2024-10-24T20:55:09.762Z, "service"=>{"type"=>"prometheus", "address"=>"http://istiod.istio-system:15014/metrics"}, "agent"=>{"type"=>"metricbeat", "version"=>"8.14.3", "id"=>"eb215fef-a3be-4c7e-99dd-4ef14fa09422", "name"=>"monitoring-cwzmb", "ephemeral_id"=>"d0144499-90a2-4b62-a372-5d764ac3b1bf"}, "ecs"=>{"version"=>"8.0.0"}, "tags"=>["beats_input_raw_event"], "@version"=>"1", "event"=>{"module"=>"prometheus", "dataset"=>"istio.istiod_metrics", "duration"=>7346460}, "data_stream"=>{"type"=>"metrics", "dataset"=>"istio.istiod_metrics", "namespace"=>"private.default.production"}, "host"=>{"hostname"=>"monitoring-cwzmb", "os"=>{"type"=>"linux", "version"=>"20.04.6 LTS (Focal Fossa)", "name"=>"Ubuntu", "platform"=>"ubuntu", "family"=>"debian", "kernel"=>"5.4.0-137-generic", "codename"=>"focal"}, "architecture"=>"x86_64", "containerized"=>true, "id"=>"96912ebd3bd4409194c45e17fda36045", "name"=>"monitoring-cwzmb"}}], :response=>{"create"=>{"status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[khWTx421ZP8kG17gAAABksBP0sI][LEIgNgvoLGR0pnOGLdGZAmS6WlVcdve3os_9icbU8DnL5pm1z9wY3r_si41D@2024-10-24T20:55:09.762Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"_UXCMxmYQSmzU4EePR_Gkw", "shard"=>"0", "index"=>".ds-metrics-istio.istiod_metrics-private.default.production-2024.10.24-000107"}}}}
[2024-10-24T20:55:10,747][WARN ][logstash.outputs.elasticsearch][elastic-agent][elastic_agent_elasticsearch_output] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-istio.istiod_metrics-private.default.production", :routing=>nil}, {"metricset"=>{"name"=>"collector", "period"=>10000}, "elastic_agent"=>{"version"=>"8.14.3", "id"=>"eb215fef-a3be-4c7e-99dd-4ef14fa09422", "snapshot"=>false}, "prometheus"=>{"pilot_k8s_reg_events"=>{"rate"=>0, "counter"=>36}, "labels"=>{"type"=>"Services", "event"=>"update", "instance"=>"istiod.istio-system:15014", "job"=>"prometheus"}}, "@timestamp"=>2024-10-24T20:55:09.762Z, "service"=>{"type"=>"prometheus", "address"=>"http://istiod.istio-system:15014/metrics"}, "agent"=>{"type"=>"metricbeat", "version"=>"8.14.3", "id"=>"eb215fef-a3be-4c7e-99dd-4ef14fa09422", "name"=>"monitoring-cwzmb", "ephemeral_id"=>"d0144499-90a2-4b62-a372-5d764ac3b1bf"}, "ecs"=>{"version"=>"8.0.0"}, "tags"=>["beats_input_raw_event"], "@version"=>"1", "event"=>{"module"=>"prometheus", "dataset"=>"istio.istiod_metrics", "duration"=>7341131}, "data_stream"=>{"type"=>"metrics", "dataset"=>"istio.istiod_metrics", "namespace"=>"private.default.production"}, "host"=>{"hostname"=>"monitoring-cwzmb", "os"=>{"type"=>"linux", "version"=>"20.04.6 LTS (Focal Fossa)", "name"=>"Ubuntu", "kernel"=>"5.4.0-137-generic", "codename"=>"focal", "platform"=>"ubuntu", "family"=>"debian"}, "architecture"=>"x86_64", "containerized"=>true, "id"=>"96912ebd3bd4409194c45e17fda36045", "name"=>"monitoring-cwzmb"}}], :response=>{"create"=>{"status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[0bDHwckAnL0f-jhCAAABksBP0sI][LEIgNgvoLGR0pnOGLdGZAmS6WlVcdve3okkBF1VJ6DEQ7exqjjeLtCZ7P4jb@2024-10-24T20:55:09.762Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"_UXCMxmYQSmzU4EePR_Gkw", "shard"=>"0", "index"=>".ds-metrics-istio.istiod_metrics-private.default.production-2024.10.24-000107"}}}}

What did you expect to see?

I expect to see these documents properly ingested.

Anything else?

The issue appears that the Istio labels are used to generate a fingerprint:

https://github.com/elastic/integrations/blob/42826c851cd38df1cb229b31f184f68ee89f7a80/packages/istio/data_stream/istiod_metrics/elasticsearch/ingest_pipeline/default.yml#L31-L34

Which is then used as a TSDS dimension:

https://github.com/elastic/integrations/blob/42826c851cd38df1cb229b31f184f68ee89f7a80/packages/istio/data_stream/istiod_metrics/fields/fields.yml#L4-L7

The problem is, is that one of the key "dimension" labels is the job label, is always overwritten (before generating the fingerprint):

https://github.com/elastic/integrations/blob/42826c851cd38df1cb229b31f184f68ee89f7a80/packages/istio/data_stream/istiod_metrics/elasticsearch/ingest_pipeline/default.yml#L23-L26

It's not clear why this value is overwritten in the first place, but with the change to TSDS and dimensions, it now seems to cause a high number of Istiod Metrics to be dropped.

BenB196 commented 2 weeks ago

Looking at the history of:

https://github.com/elastic/integrations/blob/42826c851cd38df1cb229b31f184f68ee89f7a80/packages/istio/data_stream/istiod_metrics/elasticsearch/ingest_pipeline/default.yml#L23-L26

Added as part of the original PR #4253

It's not clear why this was added, I suspect that this could be removed, and this issue could be resolved.

BenB196 commented 1 week ago

Looking at this a bit more closely, I'm not actually sure if this is a "bug" or intended.

Using a more specific example:

"_source": {
  "@timestamp": "2024-10-26T12:50:14.571Z",
  "@version": "1",
  "agent": {
    "ephemeral_id": "012b62b9-8748-4257-b148-4e82191cfdd8",
    "id": "d3c8a4ad-d4c1-41a6-bc4d-32942f79f522",
    "name": "monitoring-fq875",
    "type": "metricbeat",
    "version": "8.14.3"
  },
  "data_stream": {
    "dataset": "istio.istiod_metrics",
    "namespace": "private.default.production",
    "type": "metrics"
  },
  "ecs": {
    "version": "8.6.0"
  },
  "elastic_agent": {
    "id": "d3c8a4ad-d4c1-41a6-bc4d-32942f79f522",
    "snapshot": false,
    "version": "8.14.3"
  },
  "event": {
    "agent_id_status": "auth_metadata_missing",
    "dataset": "istio.istiod_metrics",
    "duration": 7565596,
    "ingested": "2024-10-26T12:50:25Z",
    "kind": "metric",
    "module": "istio"
  },
  "host": {
    "architecture": "x86_64",
    "containerized": true,
    "hostname": "monitoring-fq875",
    "id": "047f4adf0d834eaa883d97a880781760",
    "name": "monitoring-fq875",
    "os": {
      "codename": "focal",
      "family": "debian",
      "kernel": "5.4.0-137-generic",
      "name": "Ubuntu",
      "platform": "ubuntu",
      "type": "linux",
      "version": "20.04.6 LTS (Focal Fossa)"
    }
  },
  "istio": {
    "istiod": {
      "labels": {
        "instance": "istiod.istio-system:15014",
        "job": "prometheus",
        "version": "1.23.1"
      },
      "labels_id": "rhdvqrHt7hTr7GH5lFq2mD31JGA=",
      "metrics": {
        "pilot_xds": {
          "value": 5
        }
      }
    }
  },
  "metricset": {
    "period": 10000
  },
  "tags": "beats_input_raw_event"
}
[2024-10-26T12:50:25,557][WARN ][logstash.outputs.elasticsearch][elastic-agent][elastic_agent_elasticsearch_output] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-istio.istiod_metrics-private.default.production", :routing=>nil}, {"prometheus"=>{"labels"=>{"version"=>"1.23.1", "instance"=>"istiod.istio-system:15014", "job"=>"prometheus"}, "pilot_xds"=>{"value"=>5}}, "event"=>{"module"=>"prometheus", "dataset"=>"istio.istiod_metrics", "duration"=>8044930}, "tags"=>["beats_input_raw_event"], "@timestamp"=>2024-10-26T12:50:14.571Z, "ecs"=>{"version"=>"8.0.0"}, "@version"=>"1", "agent"=>{"ephemeral_id"=>"012b62b9-8748-4257-b148-4e82191cfdd8", "version"=>"8.14.3", "id"=>"d3c8a4ad-d4c1-41a6-bc4d-32942f79f522", "name"=>"monitoring-fq875", "type"=>"metricbeat"}, "metricset"=>{"name"=>"collector", "period"=>10000}, "data_stream"=>{"namespace"=>"private.default.production", "dataset"=>"istio.istiod_metrics", "type"=>"metrics"}, "service"=>{"type"=>"prometheus", "address"=>"http://istiod.istio-system:15014/metrics"}, "host"=>{"hostname"=>"monitoring-fq875", "containerized"=>true, "architecture"=>"x86_64", "id"=>"047f4adf0d834eaa883d97a880781760", "name"=>"monitoring-fq875", "os"=>{"version"=>"20.04.6 LTS (Focal Fossa)", "name"=>"Ubuntu", "codename"=>"focal", "type"=>"linux", "platform"=>"ubuntu", "family"=>"debian", "kernel"=>"5.4.0-137-generic"}}, "elastic_agent"=>{"version"=>"8.14.3", "id"=>"d3c8a4ad-d4c1-41a6-bc4d-32942f79f522", "snapshot"=>false}}], :response=>{"create"=>{"status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[yFVUdPrnE3bEh5JiAAABksjglas][LEIgNgvoLGR0pnOGLdGZAmTvTk69cY8zMZwD0_9-b9Zq6XJvE2Y_RiZOv_8u@2024-10-26T12:50:14.571Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"_UXCMxmYQSmzU4EePR_Gkw", "shard"=>"0", "index"=>".ds-metrics-istio.istiod_metrics-private.default.production-2024.10.24-000107"}}}}

These 2 events are almost identical, the only difference is that the event.duration value is different:

"duration": 7565596, -> "duration"=>8044930

It'd seem really weird to add duration as a TSDS dimension, but that seems to be the only difference between these 2 events. I'm not sure if this should really be a "bug" that gets fixed, or left as is.