elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.16k stars 4.91k forks source link

Do not consider 409 Conflict responses from _bulk an error when ingesting to data streams #36547

Closed cmacknz closed 11 months ago

cmacknz commented 1 year ago

When a data stream is in TSBD mode, Elasticsearch enforces that the combination of the timestamp and the chosen dimensions must be unique (see https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html#time-series-dimension).

This means that in a system that guarantees at least once delivery (and not exactly once) like Beats when an event is retried there is a chance that the event actually was ingested the first time but the response was lost (because of a network interruption for example).

When this happens it will cause the Beat to retry pointlessly, in the case of Metricbeat (the most likely to be used with TSDB) it will retry up to three times by default. The retries in this case are pointless because the document was already ingested successfully, which is exactly what the 409 response is telling us. When Metricbeat experiences this failure it will eventually move on to the next event after 3 retries, if the default retry configuration is longer the pipeline will be stalled trying to ingest an event that was already ingested.

This would be straightforward to implement if Beats had any knowledge that it was ingesting to a TSDB data stream, but the TSDB configuration applies purely on the Elasticsearch side and to Beats it is a normal datastream. We may need to consider other cases where a 409 conflict is a legitimate error.

The most likely other cases of this will be where an _id has been specifically set on data, in which case the problem is almost exactly the same in that the event already exists and ES is preventing us from duplicating it with this error.

An example error in the TSDB case is:

{
  "error": {
    "root_cause": [
      {
        "type": "version_conflict_engine_exception",
        "reason": "[B-Eo9MnlSVjnS16pAAABinZYMbA][{agent.id=8148e2e9-d14c-45d8-9ef9-881106eaa405, elastic_agent.process=filebeat}@2023-09-08T19:50:06.000Z]: version conflict, document already exists (current version [1])",
        "index_uuid": "46nFtTFqR06advjePbGvQw",
        "shard": "0",
        "index": ".ds-metrics-elastic_agent.elastic_agent-default-2023.09.08-000001"
      }
    ],
    "type": "version_conflict_engine_exception",
    "reason": "[B-Eo9MnlSVjnS16pAAABinZYMbA][{agent.id=8148e2e9-d14c-45d8-9ef9-881106eaa405, elastic_agent.process=filebeat}@2023-09-08T19:50:06.000Z]: version conflict, document already exists (current version [1])",
    "index_uuid": "46nFtTFqR06advjePbGvQw",
    "shard": "0",
    "index": ".ds-metrics-elastic_agent.elastic_agent-default-2023.09.08-000001"
  },
  "status": 409
}
lalit-satapathy commented 1 year ago

This would be straightforward to implement if Beats had any knowledge that it was ingesting to a TSDB data stream, but the TSDB configuration applies purely on the Elasticsearch side and to Beats it is a normal datastream. We may need to consider other cases where a 409 conflict is a legitimate error.

For retry case for metricbeat, summary from the above details:

For a retry case, Will agent be able to handle this error 409 differently, if the document is ingested in the TSDB index? Assuming we will still need this error, for regular usage (a non retry case), when document is ingested twice for the same timestamp, to the a TSDB index.

lucabelluccini commented 1 year ago

++ I think we would need some kind of specific error message in the error returned by ES to tell us the rejection is caused by TSDS.

felixbarny commented 1 year ago

In which cases may a retry on a 409 Conflict status be successful? Can we treat 409 responses as a permanent error and always exempt them from retries?

lucabelluccini commented 1 year ago

A 409 Conflict can be successful if there's a rollover triggered by Elasticsearch (ILM) between the first attempt and the retries on a "normal" data stream and if the client uses an explicit _id or uses TSDB with a routing_path.

Afaik, 409 are typically a signal of duplicates for integrations (for Elastic Agent) or modules (for Beats) where the _id is set by "us". BTW, the op_type set to create would anyway prevent overwriting.

Note that if we're going towards elastic shipper, where bulks from different components will possibly be bundled together, we might end up having in mixes of TSDB and not TSDB data streams.

A legit use case:

felixbarny commented 1 year ago

A 409 Conflict can be successful if there's a rollover triggered by Elasticsearch (ILM) between the first attempt and the retries on a "normal" data stream and if the client uses an explicit _id or uses TSDB with a routing_path.

That's technically correct. But I doubt that we would want to retry an event in the hope there has been a rollover in the meantime? I don't see why we would ever want to retry on a 409 but maybe I'm missing something.

Generally, I'm skeptical that there's merit in retrying any event that gets a 4xx response except for 429 Too Many Requests.

A legit use case:

  • [...]
  • One of the 2 metrics will be rejected while it shouldn't.

This sounds like dimensions have been misconfigured. The routing_path doesn't impact conflicts btw, it's about time_series_dimensions.

lucabelluccini commented 1 year ago

The routing_path doesn't impact conflicts btw, it's about time_series_dimensions.

TIL, as the response errors were sending back the triplet of those values and I thought it was the reason for the rejection.


Agree on the rest. Temporary errors are 50x (e.g. unavailability) or 429 (circuit breakers, queues...)

cmacknz commented 1 year ago

That's technically correct. But I doubt that we would want to retry an event in the hope there has been a rollover in the meantime? I don't see why we would ever want to retry on a 409 but maybe I'm missing something.

Agreed. It is much simpler if we can always treat 409s as a non-retryable error because the document has already been ingested.

cmacknz commented 1 year ago

https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#_retry_policy

Logstash logs 409 errors and drops them and has done so for a long time, it should be fine to take the same approach in Beats.

409 errors (conflict) are logged as a warning and dropped.

Note that 409 exceptions are no longer retried. Please set a higher retry_on_conflict value if you experience 409 exceptions. It is more performant for Elasticsearch to retry these exceptions than this plugin.

eedugon commented 1 year ago

I'd like to share that not retrying these errors is probably the best approach here, but we shouldn't ignore them in terms of logging, as in case of the 409 response being related with a real bug we would lose visibility.

As an example in this issue we have recognized that the TSDB configuration in certain integrations is wrong and it's causing metrics data loss due to incorrect duplicates detections: https://github.com/elastic/integrations/issues/7977

In conclusion:

cmacknz commented 1 year ago

Agreed we need to log these by default since it the best (only?) way to detect improperly configured TSDB dimensions.

I would follow the Logstash strategy for logging them by default but having an option to turn them off. Since agent logs are frequently shipped to Fleet, this will be useful in extreme cases where the volume of 409 errors is causing problems for the Fleet cluster.

belimawr commented 12 months ago

Folks, I'm not managing to reproduce this behaviour. Is there a simple example of running any Beat that consistently leads to 409s being re-tried?

I looked at the Beats code and 409s do not seem to be retried: https://github.com/elastic/beats/blob/6be0d18448bb84130cf1976b1521f45717c6b2fb/libbeat/outputs/elasticsearch/client.go#L414-L419

bulkCollectPublishFails is called by publishEvents which is called by Publish and Publish only retries if publishEvents returns some events: https://github.com/elastic/beats/blob/6be0d18448bb84130cf1976b1521f45717c6b2fb/libbeat/outputs/elasticsearch/client.go#L187-L215

Which it does not, at least on my tests.

I tried to reproduce it by creating a TSDB:

PUT _component_template/my-409-test
{
  "template": {
    "mappings": {
      "properties": {
        "message": {
          "type": "keyword"
        },
        "foo": {
          "type": "keyword",
          "time_series_dimension": true
        },
        "@timestamp": {
          "type": "date",
          "format": "strict_date_optional_time"
        }
      }
    }
  },
  "_meta": {
    "description": "Mappings for testing data"
  }
}

PUT _index_template/my-409-test-index-template
{
  "index_patterns": ["filebeat-*"],
  "data_stream": { },
  "template": {
    "settings": {
      "index.mode": "time_series",
      "index.routing_path": [ "foo"]
    }
  },
  "composed_of": [ "my-409-test"],
  "priority": 500,
  "_meta": {
    "description": "Template for test data"
  }
}

And then running a Go test using the Libbeat ES client:

func TestNew409(t *testing.T) {
    logp.DevelopmentSetup(logp.WithSelectors("*"))
    client, err := NewClient(
        ClientSettings{
            ConnectionSettings: eslegclient.ConnectionSettings{
                URL:      "https://localhost:9200",
                Username: "elastic",
                Password: "changeme",
                Transport: httpcommon.HTTPTransportSettings{
                    TLS: &tlscommon.Config{
                        VerificationMode: tlscommon.VerifyNone},
                },
            },
            Observer:           outputs.NewNilObserver(),
            NonIndexableAction: "drop",
            Index:              outil.MakeSelector(outil.ConstSelectorExpr("filebeat-test", outil.SelectorKeepCase)),
        },
        nil,
    )
    if err != nil {
        t.Fatalf("could not create ES client: %s", err)
    }

    now := time.Now()
    events := []publisher.Event{
        {
            Content: beat.Event{
                Timestamp: now,
                Fields: mapstr.M{
                    "message": "foo",
                    "foo":     "foo1",
                },
            },
        },
        {
            Content: beat.Event{
                Timestamp: now,
                Fields: mapstr.M{
                    "message": "bar",
                    "foo":     "foo2",
                },
            },
        },
        {
            Content: beat.Event{
                Timestamp: now,
                Fields: mapstr.M{
                    "message": "foo 2",
                    "foo":     "foo1",
                },
            },
        },
    }

    failed, err := client.publishEvents(context.TODO(), events)
    if err != nil {
        t.Errorf("failed to publish events: %s", err)
    }

    for _, evt := range failed {
        fmt.Printf("Failed Event fields: %#v\n", evt.Content.Fields)
    }
}

Which works fine:

=== RUN   TestNew409
{"log.level":"info","@timestamp":"2023-10-16T15:55:39.410+0200","log.logger":"esclientleg","log.origin":{"file.name":"eslegclient/connection.go","file.line":122},"message":"elasticsearch url: https://localhost:9200","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-16T15:55:39.410+0200","log.logger":"tls","log.origin":{"file.name":"tlscommon/tls_config.go","file.line":107},"message":"SSL/TLS verifications disabled.","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-10-16T15:55:39.410+0200","log.logger":"esclientleg","log.origin":{"file.name":"eslegclient/connection.go","file.line":284},"message":"ES Ping(url=https://localhost:9200)","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-16T15:55:39.410+0200","log.logger":"tls","log.origin":{"file.name":"tlscommon/tls_config.go","file.line":107},"message":"SSL/TLS verifications disabled.","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-10-16T15:55:39.413+0200","log.logger":"esclientleg","log.origin":{"file.name":"transport/logging.go","file.line":42},"message":"Completed dialing successfully","network":"tcp","address":"localhost:9200","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-10-16T15:55:39.414+0200","log.logger":"esclientleg","log.origin":{"file.name":"eslegclient/connection.go","file.line":303},"message":"Ping status code: 200","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-16T15:55:39.414+0200","log.logger":"esclientleg","log.origin":{"file.name":"eslegclient/connection.go","file.line":304},"message":"Attempting to connect to Elasticsearch version 8.8.1 (default)","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-10-16T15:55:39.432+0200","log.logger":"elasticsearch","log.origin":{"file.name":"elasticsearch/client.go","file.line":265},"message":"PublishEvents: 3 events have been published to elasticsearch in 22.458311ms.","ecs.version":"1.6.0"}
status, msg, err: 201  <nil>
status, msg, err: 201  <nil>
status, msg, err: 409 {"type":"version_conflict_engine_exception","reason":"[n3nc2IA-xOEbwlQvAAABizjFWRI][{foo=foo1}@2023-10-16T13:55:39.410Z]: version conflict, document already exists (current version [1])","index_uuid":"qGUPCejxRDawcbcnd0zhkA","shard":"0","index":".ds-filebeat-test-2023.10.16-000001"} <nil>
    client_test.go:991: Failed events: []
--- PASS: TestNew409 (0.02s)
PASS
ok      github.com/elastic/beats/v7/libbeat/outputs/elasticsearch       0.288s

And as expected only two events are ingested:

POST /filebeat-test/_search
{
  "query": {
    "match_all": {}
  }
}

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": ".ds-filebeat-test-2023.10.16-000001",
        "_id": "n3nc2IA-xOEbwlQvAAABizjFWRI",
        "_score": 1,
        "_source": {
          "@timestamp": "2023-10-16T13:55:39.410Z",
          "foo": "foo1",
          "message": "foo"
        }
      },
      {
        "_index": ".ds-filebeat-test-2023.10.16-000001",
        "_id": "btHh9gpvV0Gxjn5uAAABizjFWRI",
        "_score": 1,
        "_source": {
          "@timestamp": "2023-10-16T13:55:39.410Z",
          "foo": "foo2",
          "message": "bar"
        }
      }
    ]
  }
}
cmacknz commented 12 months ago

Great if the ES output already does this (I swear I checked this but perhaps I only searched for http. StatusConflict). In the issue that uncovered this we were using the Logstash output. Not sure if it is the same there (or if we even get a 409 back from LS in this case).

The real root cause here may be https://github.com/elastic/integrations/issues/7977 which was discovered after this issue was filed. Anyone enabling agent metrics collection with a Logstash output will see these errors today because our TSDB dimensions in the agent package aren't correct.

If there are no changes required in the code we should at least update the ES output documentation to note that 409 errors are not retried.

belimawr commented 12 months ago

Tomorrow I'll try the LS output and see if there is anything needed in Beats for that case. I'll also update the ES documentation.

belimawr commented 12 months ago

I was looking at the docs and we do not have any entry about how the different responses from ES are handled nor it mentions which ES APIs we use.

I'm thinking about the best way to add this, should we add a whole section about the APIs used (for sending data, as far as I know, it's just the bulk API) and how the different codes are handled?

belimawr commented 12 months ago

I managed to reproduce with LS output for both standalone and Fleet managed.

LS just logs in warn level

[2023-10-17T12:53:26,563][WARN ][logstash.outputs.elasticsearch][main][b07c36d904c438de30eb6b997e483472ec6510ed01e4cf959630c1ff7d6e94b3] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-elastic_agent.filebeat-default", :routing=>nil}, {"beat"=>{"id"=>"6a9c260d-adb1-4fb2-9947-557f4c363d68", "type"=>"filebeat", "stats"=>{"runtime"=>{"goroutines"=>57}, "beat"=>{"type"=>"filebeat", "host"=>"millennium-falcon", "version"=>"8.10.3", "name"=>"millennium-falcon", "uuid"=>"6a9c260d-adb1-4fb2-9947-557f4c363d68"}, "cgroup"=>{"cpu"=>{"stats"=>{"periods"=>0, "throttled"=>{"periods"=>0, "ns"=>0}}, "id"=>"emacs.service"}, "memory"=>{"mem"=>{"usage"=>{"bytes"=>9438642176.0}}, "id"=>"emacs.service"}}, "libbeat"=>{"config"=>{"stops"=>0, "running"=>2, "reloads"=>0, "starts"=>2}, "output"=>{"events"=>{"failed"=>0, "total"=>0, "toomany"=>0, "batches"=>0, "dropped"=>0, "acked"=>0, "duplicates"=>0, "active"=>0}, "type"=>"logstash", "read"=>{"errors"=>0, "bytes"=>0}, "write"=>{"errors"=>0, "bytes"=>0}}, "pipeline"=>{"clients"=>2, "events"=>{"failed"=>0, "filtered"=>0, "total"=>0, "dropped"=>0, "published"=>0, "retry"=>0, "active"=>0}, "queue"=>{"acked"=>0}}}, "memstats"=>{"gc_next"=>37850624, "memory"=>{"alloc"=>36210904, "total"=>108616976}, "rss"=>130453504}, "handles"=>{"open"=>16, "limit"=>{"hard"=>524288, "soft"=>524288}}, "system"=>{"cpu"=>{"cores"=>16}, "load"=>{"norm"=>{"15"=>0.0456, "5"=>0.0456, "1"=>0.06}, "1"=>0.96, "15"=>0.73, "5"=>0.73}}, "cpu"=>{"system"=>{"ticks"=>140, "time"=>{"ms"=>140}}, "total"=>{"ticks"=>310, "value"=>310, "time"=>{"ms"=>310}}, "user"=>{"ticks"=>170, "time"=>{"ms"=>170}}}, "info"=>{"ephemeral_id"=>"5ba76f83-c083-41d6-a9dd-7e4b60444221", "version"=>"8.10.3", "name"=>"filebeat", "uptime"=>{"ms"=>231393}}, "uptime"=>{"ms"=>231393}}}, "data_stream"=>{"namespace"=>"default", "type"=>"metrics", "dataset"=>"elastic_agent.filebeat"}, "event"=>{"duration"=>3351225, "dataset"=>"elastic_agent.filebeat", "module"=>"beat"}, "@timestamp"=>2023-10-17T10:53:25.277Z, "@version"=>"1", "ecs"=>{"version"=>"8.0.0"}, "elastic_agent"=>{"id"=>"83e2fcdf-5a09-4759-b4a6-a1eed5f16233", "snapshot"=>false, "version"=>"8.10.3", "process"=>"filebeat"}, "service"=>{"type"=>"beat", "address"=>"http://unix/stats", "name"=>"beat"}, "agent"=>{"type"=>"metricbeat", "id"=>"83e2fcdf-5a09-4759-b4a6-a1eed5f16233", "version"=>"8.10.3", "name"=>"millennium-falcon", "ephemeral_id"=>"befbd997-2651-40b7-a147-db1662289153"}, "host"=>{"hostname"=>"millennium-falcon", "architecture"=>"x86_64", "os"=>{"version"=>"", "name"=>"Arch Linux", "family"=>"arch", "kernel"=>"6.5.7-arch1-1", "build"=>"rolling", "type"=>"linux", "platform"=>"arch"}, "id"=>"851f339d77174301b29e417ecb2ec6a8", "name"=>"some-hostname", "mac"=>["01-02-03-04-05-06"], "ip"=>["192.168.0.1",, "172.17.0.1"], "containerized"=>false}, "tags"=>["beats_input_raw_event"], "metricset"=>{"period"=>10000, "name"=>"stats"}}], :response=>{"create"=>{"_index"=>".ds-metrics-elastic_agent.filebeat-default-2023.10.17-000001", "_id"=>"JOg3kN-CYNHveGcCAAABiz1E3Z0", "status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[JOg3kN-CYNHveGcCAAABiz1E3Z0][{agent.id=83e2fcdf-5a09-4759-b4a6-a1eed5f16233}@2023-10-17T10:53:25.277Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"pjNMFNEZSHOA84dSgiTfIw", "shard"=>"0", "index"=>".ds-metrics-elastic_agent.filebeat-default-2023.10.17-000001"}}}}

[2023-10-17T12:53:36,564][WARN ][logstash.outputs.elasticsearch][main][b07c36d904c438de30eb6b997e483472ec6510ed01e4cf959630c1ff7d6e94b3] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-elastic_agent.filebeat-default", :routing=>nil}, {"beat"=>{"id"=>"6a9c260d-adb1-4fb2-9947-557f4c363d68", "type"=>"filebeat", "stats"=>{"beat"=>{"type"=>"filebeat", "host"=>"some-hostname", "version"=>"8.10.3", "name"=>"some-hostname", "uuid"=>"6a9c260d-adb1-4fb2-9947-557f4c363d68"}, "runtime"=>{"goroutines"=>57}, "cgroup"=>{"cpu"=>{"stats"=>{"periods"=>0, "throttled"=>{"periods"=>0, "ns"=>0}}, "id"=>"emacs.service"}, "memory"=>{"mem"=>{"usage"=>{"bytes"=>9438232576.0}}, "id"=>"emacs.service"}}, "libbeat"=>{"config"=>{"running"=>2, "stops"=>0, "reloads"=>0, "starts"=>2}, "output"=>{"events"=>{"failed"=>0, "total"=>0, "toomany"=>0, "batches"=>0, "dropped"=>0, "acked"=>0, "duplicates"=>0, "active"=>0}, "type"=>"logstash", "read"=>{"errors"=>0, "bytes"=>0}, "write"=>{"errors"=>0, "bytes"=>0}}, "pipeline"=>{"clients"=>2, "events"=>{"failed"=>0, "filtered"=>0, "total"=>0, "dropped"=>0, "published"=>0, "retry"=>0, "active"=>0}, "queue"=>{"acked"=>0}}}, "memstats"=>{"gc_next"=>37546096, "memory"=>{"alloc"=>20494848, "total"=>110690024}, "rss"=>130453504}, "handles"=>{"limit"=>{"hard"=>524288, "soft"=>524288}, "open"=>17}, "system"=>{"cpu"=>{"cores"=>16}, "load"=>{"norm"=>{"1"=>0.0606, "15"=>0.0456, "5"=>0.0463}, "1"=>0.97, "15"=>0.73, "5"=>0.74}}, "cpu"=>{"system"=>{"ticks"=>140, "time"=>{"ms"=>140}}, "total"=>{"ticks"=>310, "value"=>310, "time"=>{"ms"=>310}}, "user"=>{"ticks"=>170, "time"=>{"ms"=>170}}}, "info"=>{"ephemeral_id"=>"5ba76f83-c083-41d6-a9dd-7e4b60444221", "version"=>"8.10.3", "name"=>"filebeat", "uptime"=>{"ms"=>241392}}, "uptime"=>{"ms"=>241392}}}, "data_stream"=>{"namespace"=>"default", "type"=>"metrics", "dataset"=>"elastic_agent.filebeat"}, "event"=>{"duration"=>2743292, "dataset"=>"elastic_agent.filebeat", "module"=>"beat"}, "@timestamp"=>2023-10-17T10:53:35.277Z, "@version"=>"1", "ecs"=>{"version"=>"8.0.0"}, "elastic_agent"=>{"id"=>"83e2fcdf-5a09-4759-b4a6-a1eed5f16233", "snapshot"=>false, "version"=>"8.10.3", "process"=>"filebeat"}, "agent"=>{"type"=>"metricbeat", "ephemeral_id"=>"befbd997-2651-40b7-a147-db1662289153", "version"=>"8.10.3", "id"=>"83e2fcdf-5a09-4759-b4a6-a1eed5f16233", "name"=>"some-hostname"}, "service"=>{"type"=>"beat", "address"=>"http://unix/stats", "name"=>"beat"}, "host"=>{"hostname"=>"some-hostname", "architecture"=>"x86_64", "os"=>{"version"=>"", "name"=>"Arch Linux", "family"=>"arch", "kernel"=>"6.5.7-arch1-1", "build"=>"rolling", "type"=>"linux", "platform"=>"arch"}, "id"=>"851f339d77174301b29e417ecb2ec6a8", "name"=>"millennium-falcon", "mac"=>["01-02-03-04-05-06"], "ip"=>["192.168.0.1",, "172.17.0.1"], "containerized"=>false}, "tags"=>["beats_input_raw_event"], "metricset"=>{"period"=>10000, "name"=>"stats"}}], :response=>{"create"=>{"_index"=>".ds-metrics-elastic_agent.filebeat-default-2023.10.17-000001", "_id"=>"JOg3kN-CYNHveGcCAAABiz1FBK0", "status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[JOg3kN-CYNHveGcCAAABiz1FBK0][{agent.id=83e2fcdf-5a09-4759-b4a6-a1eed5f16233}@2023-10-17T10:53:35.277Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"pjNMFNEZSHOA84dSgiTfIw", "shard"=>"0", "index"=>".ds-metrics-elastic_agent.filebeat-default-2023.10.17-000001"}}}}

I'll look into fixing the mappings.

cmacknz commented 12 months ago

I'm thinking about the best way to add this, should we add a whole section about the APIs used (for sending data, as far as I know, it's just the bulk API) and how the different codes are handled?

This sounds reasonable, adding a quick summary confirming we use the _bulk API and any special handling for specific errors to the ES output documentation for the Beats makes sense. https://www.elastic.co/guide/en/beats/filebeat/current/elasticsearch-output.html for example.

consulthys commented 5 months ago

@belimawr There is a very simple way to reproduce this behavior as demonstrated here (see case 2). When using Metricbeat to feed Stack Monitoring, the elasticsearch module of Metricbeat ships elasticsearch.shard documents with concrete IDs that are made of the current cluster state (i.e., state_uuid) and some other constant data. Since the cluster state doesn't change at the same pace as Metricbeat collection rounds (10s by default), those version conflicts happen all the time.

Those version conflicts are actually a side-effect of switching to data streams in 8.0.0 (i.e. put if absent semantics with concrete ID) and weren't apparent earlier. Since each elasticsearch.shard document is about a shard placement in the cluster, the logic makes sense, i.e. there's no point re-indexing a document whose content hasn't changed since the last collection round.

However, we could/should go one step further and detect if the cluster state hasn't changed between two collection rounds (i.e. simply compare the old and new state_uuid). If there's no change, there's no point in even rebuilding those documents and sending them again, since we know they'll bounce anyway. That wastes network bandwidth and CPU/RAM resource on ES side. For big clusters with many thousands of shards, that can make a big difference.

I was about to create a new issue for reporting this until I stumbled upon this issue. Since this case is closed, let me know if I should proceed with another issue or if this is already being taken care of in another issue I haven't yet stumbled upon.

Thank you very much.

cmacknz commented 5 months ago

Please go ahead and create a new issue, it's a related problem but the fix is in a different place and a separate issue will make that clear.