elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
104 stars 3.51k forks source link

Document already exist (http 409) cases document improvements. #16117

Closed mashhurs closed 3 months ago

mashhurs commented 7 months ago

Tell us about the issue

Description:
There are various situation where ES may reject the event with document already exist. Purpose of this issue to collect such cases and add a short documentation (under the whichever suitable place, in troubleshooting or support doc or es-output) as we are getting same question over and over.

Example log when Logstash receives a rejected event:

    [2024-04-24T14:13:25,988][WARN ][logstash.outputs.elasticsearch][Elastic-Agent-to-Logstash].[6ef18a6008d3cea8f01e0cd409c22213845cff0829c5f76b02d18a30a22c024d] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-windows.service-default", :routing=>nil}, {"host"=>{"mac"=>["...", "..."], "name"=>"redacted-name", "ip"=>["1", "2", "3"], "architecture"=>"x86_64", "id"=>"aae15", "os"=>{"name"=>"Windows Server 2016 Datacenter", "platform"=>"windows", "type"=>"windows", "kernel"=>"10.0.14393.6897 (rs1_release.240404-1613)", "family"=>"windows", "build"=>"14393.6897", "version"=>"10.0"}, "hostname"=>"redacted-host"}, "service"=>{"type"=>"windows"}, "elastic_agent"=>{"version"=>"8.11.3", "id"=>"48423d1c-5a87-46ef-b6a8-baf90b515e63", "snapshot"=>false}, "metricset"=>{"name"=>"service", "period"=>60000}, "event"=>{"duration"=>211345500, "module"=>"windows", "dataset"=>"windows.service"}, "cloud"=>{"service"=>{"name"=>"redacted-name"}, "instance"=>{"id"=>"redacted-id", "name"=>"redacted-name"}, "provider"=>"openstack", "machine"=>{"type"=>"t"}, "availability_zone"=>"zone"}, "@timestamp"=>2024-04-24T14:12:24.991Z, ...., "type"=>"metricbeat", "id"=>"48423d1c-5a87-46ef-b6a8-baf90b515e63", "ephemeral_id"=>"fdd26638-3865-4e63-961a-69d8f419538a", "version"=>"8.11.3"}, "@version"=>"1"}], :response=>{"create"=>{"status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[abcd][{agent.id=my-agent-id, cloud.availability_zone=zone, cloud.instance.id=i-id, windows.service.pid=996, windows.service.state=Running}@2024-04-24T14:12:24.991Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"wBkYL16ZSo2fJ7tAC3zbbw", "shard"=>"0", "index"=>".ds-metrics-windows.service-default-2024.04.22-000065"}}}}

URL:

Example: https://www.elastic.co/guide/en/logstash/current/introduction.html

Anything else?

robbavey commented 7 months ago

There is another potential cause for these 409 conflicts.

When integrations write to a TSDS enabled index, the document id is defined as "a hash of the document’s dimensions and @timestamp".

The document's dimensions are defined in the integration, and when events are sent at a frequency > 1 per millisecond, and the dimensions are insufficient to disambiguate those events, a version conflict will arise. This has already been seen in the integrations for the elastic agent and mysql, and I suspect there are more that can cause the issue

mashhurs commented 5 months ago

Had created a doc for fingerpint case but not sure if we have a recommended resolution for TSDS case: https://github.com/mashhurs/logstash/blob/docs-409-issue/docs/static/troubleshoot/ts-elasticsearch.asciidoc

mashhurs commented 3 months ago

Mostly, we get document level http-409 with following cases:

1. Customers are manually setting document _id (not specific to agent -> LS)

They need to ensure _id uniqueness. If you see document_id setting (maybe with a dynamic sprintf style like "%{[@metadata][id]}") in es-output plugin of pipeline configs, most likely this is a cause. This can be double confirmed if you see exact doc ID, {:_id=>supposed-to-be-unique-id, ....

2. [elastic-agent -> LS] Logstash is experiencing backpressure

When {ls} faces backpressure, it cannot acknowledge the events back to elastic-agent and as a result agent timeouts, resends the events. However, previously sent events are (fully or partially) indexed into ES. From the agent side, extending timeout in elastic-agent configuration may improve the situation. However, it is highly recommended to

3. [elastic-agent -> LS] Time series data stream (TSDS) based integration

_id for TSDS documents is a hash value, calculated by integrations based on https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html#time-series-dimension[document's dimentions] and @timestamp. Depending on dimension granularity and the time frequency ES receives documents, _id might not be unique. Based on TSDS guide, it can be identified if index is time_series or not by taking a look at its index template (index.mode: time_series). In the index template, we can also see which fields are dimensions ("time_series_dimension": true). You may need to analyze figuring out dimension + @timestamp uniqueness.

With this symptom, I haven't seen the _id is explicitly set, it is mostly nil (see the error example below). Reproduced fresh example of error:

[2024-08-26T15:11:43,587][WARN ][logstash.outputs.elasticsearch][main][5da2de1a6015f56bca61a3878c1e397ea47ce4da8031f78e07087f62e900c6c6] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-system.network-default", :routing=>nil}, {"@timestamp"=>2024-08-26T21:48:48.268Z, "agent"=>{"id"=>"95e1ef4e-6e7a-4cfd-88e8-98df506ad305", "type"=>"metricbeat", "ephemeral_id"=>"9f7fc9b8-a93c-4252-8312-41668e9bb310", "name"=>"mashhurs-host, "version"=>"8.14.1"}, "system"=>{"network"=>{"out"=>{"packets"=>1, "dropped"=>0, "bytes"=>100, "errors"=>0}, "name"=>"utun0", "in"=>{"packets"=>0, "dropped"=>0, "bytes"=>0, "errors"=>0}}}, "service"=>{"type"=>"system"}, "metricset"=>{"name"=>"network", "period"=>1000}, "ecs"=>{"version"=>"8.0.0"}, "data_stream"=>{"namespace"=>"default", "type"=>"metrics", "dataset"=>"system.network"}, "tags"=>["beats_input_raw_event"], "event"=>{"duration"=>30177083, "dataset"=>"system.network", "module"=>"system"}, "@version"=>"1", "elastic_agent"=>{"id"=>"95e1ef4e-6e7a-4cfd-88e8-....", "snapshot"=>false, "version"=>"8.14.1"}}], :response=>{"create"=>{"status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[X2S_eqC1kCpFpb1lAAABkZCpuww][LI_yqguJnd4yG2gWc-sq5af-yVYaXIwj-eRIWXnMt35ym4EgBCfmRK6iivMF@2024-08-26T21:48:48.268Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"YJE1kdgbTvuNeS23sbDVQg", "shard"=>"0", "index"=>".ds-metrics-system.network-default-2024.08.26-000003"}}}}

4. Can be ignored es-output has a silence_errors_in_log option specifically useful for this situations. Since the doc is already indexed, it is up to your preference to ignore it. Example config:

    output {
      elasticsearch {
        silence_errors_in_log => ["version_conflict_engine_exception"]
      }
    }