elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.14k stars 4.91k forks source link

Elasticsearch output does not recover after connection failure #40705

Open belimawr opened 1 week ago

belimawr commented 1 week ago

For confirmed bugs, please report:

Steps to reproduce

  1. Deploy a Filebeat sending data to a remote Elasticsearch using a domain name in the configuration
  2. Confirm Filebeat is running and sending data
  3. Disable the network
  4. Wait for a DNS lookup error and the publisher errors:
    {
      "log.level": "warn",
      "@timestamp": "2024-09-06T08:30:33.240-0400",
      "log.logger": "transport",
      "log.origin": {
        "function": "github.com/elastic/elastic-agent-libs/transport.TestNetDialer.func1",
        "file.name": "transport/tcp.go",
        "file.line": 53
      },
      "message": "DNS lookup failure \"remote-es.elastic.cloud\": lookup remote-es.elastic.cloud: Temporary failure in name resolution",
      "service.name": "filebeat",
      "ecs.version": "1.6.0"
    }
    {
      "log.level": "error",
      "@timestamp": "2024-09-06T15:50:15.140-0400",
      "log.logger": "publisher_pipeline_output",
      "log.origin": {
        "function": "github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).run",
        "file.name": "pipeline/client_worker.go",
        "file.line": 148
      },
      "message": "Failed to connect to backoff(elasticsearch(https://remote-es.elastic.cloud:443)): Get \"https://remote-es.elastic.cloud:443\": context canceled",
      "service.name": "filebeat",
      "ecs.version": "1.6.0"
    }
    {
      "log.level": "info",
      "@timestamp": "2024-09-06T15:50:15.140-0400",
      "log.logger": "publisher_pipeline_output",
      "log.origin": {
        "function": "github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).run",
        "file.name": "pipeline/client_worker.go",
        "file.line": 139
      },
      "message": "Attempting to reconnect to backoff(elasticsearch(https://remote-es.elastic.cloud:443)) with 475 reconnect attempt(s)",
      "service.name": "filebeat",
      "ecs.version": "1.6.0"
    }
  5. Enable the network
  6. Ensure the machine can reach the internet and the remote Elasticsearch
  7. Filebeat will keep logging the same publisher errors and not sending data

The configuration I used:

filebeat.inputs:
  - type: filestream
    id: my-filestream-id
    paths:
      - /tmp/some-logs.txt

output.elasticsearch:
  hosts: ["https://remote-es.elastic.cloud:443"]
  username: elastic
  password: some-very-secret-password
elasticmachine commented 1 week ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz commented 1 week ago

Get \"https://remote-es.elastic.cloud:443\": context canceled

This isn't a DNS error, so to me it looks like we did recover from the DNS failure but something else is wrong, perhaps a context that expired when it shouldn't have somewhere in one of the initial connection callbacks. The Get \"https://remote-es.elastic.cloud:443\" is likely to get the version and deployment type.

https://github.com/elastic/beats/blob/8a56ef81034dc0f917e132fd3cd0189891797348/libbeat/cmd/instance/beat.go#L1099

https://github.com/elastic/beats/blob/8a56ef81034dc0f917e132fd3cd0189891797348/libbeat/esleg/eslegclient/connection.go#L400-L408

https://github.com/elastic/beats/blob/8a56ef81034dc0f917e132fd3cd0189891797348/libbeat/esleg/eslegclient/connection.go#L326-L332

Perhaps we are closing the connection after the DNS failure instead of reusing it.

belimawr commented 1 week ago

That makes sense. I didn't investigate the issue, I only managed to reproduce and report it and the DNS error seemed to be the trigger for not being able to connect to ES again.

cmacknz commented 1 week ago

Potentially related: https://github.com/elastic/beats/pull/40572

cmacknz commented 1 week ago

^Looking at the changes there I see:

// Close closes a connection.
func (conn *Connection) Close() error {
    conn.HTTP.CloseIdleConnections()
+   conn.cancelReqs()
    return nil
}

Creating a new context only happens when NewConnection is called and there must be code paths that do not do that properly after close is called.

This change is only in 8.15.1 from what I can tell.

cmacknz commented 1 week ago

Linking the 8.15.0 backport: https://github.com/elastic/beats/commit/b19844ffdae6861feaf2ee02ce11936d80b243cb

git tag --contains b19844ffdae6861feaf2ee02ce11936d80b243cb
v8.15.1
cmacknz commented 1 week ago

The first thing we need to do is write a test that reproduces this problem.

cmacknz commented 6 days ago

I think we should revert https://github.com/elastic/beats/pull/40572 once we double check removing it fixes the problem, then work on re-adding it back with a fix + test for this.

This issue is more severe than the original problem that PR was trying to fix.

cmacknz commented 6 days ago

FYI @marc-gr

belimawr commented 6 days ago

I've been testing and you're correct Craig the conn.cancelRequs() is the issue. The Close method from the connection https://github.com/elastic/beats/blob/cb577317da1587934d728bdcd7658176a430dee2/libbeat/esleg/eslegclient/connection.go#L327-L331

Did not use to cancel any in-flight requests nor it rendered the connection unusable.

https://github.com/elastic/beats/pull/40572 cannot be automatically reverted :/, I'll create a PR removing the culprit line, which effectively restores the old behaviour from the Elasticsearch client of not cancelling in-flight requests. Everything else added by https://github.com/elastic/beats/pull/40572 should still work fine.

belimawr commented 6 days ago

The "revert-ish" PR: https://github.com/elastic/beats/pull/40769

belimawr commented 6 days ago

Investigating more, I found the root cause of the issue. On error our backoffClient will call Close() on the client: https://github.com/elastic/beats/blob/cb577317da1587934d728bdcd7658176a430dee2/libbeat/outputs/backoff.go#L60-L67

The Client then calls Close in the connection: https://github.com/elastic/beats/blob/cb577317da1587934d728bdcd7658176a430dee2/libbeat/outputs/elasticsearch/client.go#L539-L541

The connection Close method is https://github.com/elastic/beats/blob/cb577317da1587934d728bdcd7658176a430dee2/libbeat/esleg/eslegclient/connection.go#L327-L331

When https://github.com/elastic/beats/pull/40572 was merged, the call to conn.cancelReqs() was introduced, which cancels the context created by NewConnection https://github.com/elastic/beats/blob/cb577317da1587934d728bdcd7658176a430dee2/libbeat/esleg/eslegclient/connection.go#L187-L196

that is used in every request (L404) https://github.com/elastic/beats/blob/cb577317da1587934d728bdcd7658176a430dee2/libbeat/esleg/eslegclient/connection.go#L400-L413

and never recreated, which renders the whole Connection unusable, which was not the old behavour.

marc-gr commented 6 days ago

@belimawr thanks for taking a look, would be enough to recreate the context on close to make the client reusable? I think just removing the call to cancelReqs might make the stop racey since IIRC this was the main reason publishers were not closed before on stop.

belimawr commented 6 days ago

would be enough to recreate the context on close to make the client reusable?

Mostly, the same instance of the connection is also used in a callback, so I also made sure they both hold a pointer reference so when the context is recreated both can use the new one.

I think just removing the call to cancelReqs might make the stop racey since IIRC this was the main reason publishers were not closed before on stop.

The PR removing the call to cancelReqs was just a quick patch to keep main releasable, I've just created a new PR with the proper fix.