Open belimawr opened 1 week ago
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
Get \"https://remote-es.elastic.cloud:443\": context canceled
This isn't a DNS error, so to me it looks like we did recover from the DNS failure but something else is wrong, perhaps a context that expired when it shouldn't have somewhere in one of the initial connection callbacks. The Get \"https://remote-es.elastic.cloud:443\"
is likely to get the version and deployment type.
Perhaps we are closing the connection after the DNS failure instead of reusing it.
That makes sense. I didn't investigate the issue, I only managed to reproduce and report it and the DNS error seemed to be the trigger for not being able to connect to ES again.
Potentially related: https://github.com/elastic/beats/pull/40572
^Looking at the changes there I see:
// Close closes a connection.
func (conn *Connection) Close() error {
conn.HTTP.CloseIdleConnections()
+ conn.cancelReqs()
return nil
}
Creating a new context only happens when NewConnection
is called and there must be code paths that do not do that properly after close is called.
This change is only in 8.15.1 from what I can tell.
Linking the 8.15.0 backport: https://github.com/elastic/beats/commit/b19844ffdae6861feaf2ee02ce11936d80b243cb
git tag --contains b19844ffdae6861feaf2ee02ce11936d80b243cb
v8.15.1
The first thing we need to do is write a test that reproduces this problem.
I think we should revert https://github.com/elastic/beats/pull/40572 once we double check removing it fixes the problem, then work on re-adding it back with a fix + test for this.
This issue is more severe than the original problem that PR was trying to fix.
FYI @marc-gr
I've been testing and you're correct Craig the conn.cancelRequs()
is the issue. The Close
method from the connection
https://github.com/elastic/beats/blob/cb577317da1587934d728bdcd7658176a430dee2/libbeat/esleg/eslegclient/connection.go#L327-L331
Did not use to cancel any in-flight requests nor it rendered the connection unusable.
https://github.com/elastic/beats/pull/40572 cannot be automatically reverted :/, I'll create a PR removing the culprit line, which effectively restores the old behaviour from the Elasticsearch client of not cancelling in-flight requests. Everything else added by https://github.com/elastic/beats/pull/40572 should still work fine.
The "revert-ish" PR: https://github.com/elastic/beats/pull/40769
Investigating more, I found the root cause of the issue. On error our backoffClient
will call Close()
on the client:
https://github.com/elastic/beats/blob/cb577317da1587934d728bdcd7658176a430dee2/libbeat/outputs/backoff.go#L60-L67
The Client then calls Close
in the connection:
https://github.com/elastic/beats/blob/cb577317da1587934d728bdcd7658176a430dee2/libbeat/outputs/elasticsearch/client.go#L539-L541
The connection Close
method is https://github.com/elastic/beats/blob/cb577317da1587934d728bdcd7658176a430dee2/libbeat/esleg/eslegclient/connection.go#L327-L331
When https://github.com/elastic/beats/pull/40572 was merged, the call to conn.cancelReqs()
was introduced, which cancels the context created by NewConnection
https://github.com/elastic/beats/blob/cb577317da1587934d728bdcd7658176a430dee2/libbeat/esleg/eslegclient/connection.go#L187-L196
that is used in every request (L404) https://github.com/elastic/beats/blob/cb577317da1587934d728bdcd7658176a430dee2/libbeat/esleg/eslegclient/connection.go#L400-L413
and never recreated, which renders the whole Connection
unusable, which was not the old behavour.
@belimawr thanks for taking a look, would be enough to recreate the context on close to make the client reusable? I think just removing the call to cancelReqs might make the stop racey since IIRC this was the main reason publishers were not closed before on stop.
would be enough to recreate the context on close to make the client reusable?
Mostly, the same instance of the connection is also used in a callback, so I also made sure they both hold a pointer reference so when the context is recreated both can use the new one.
I think just removing the call to cancelReqs might make the stop racey since IIRC this was the main reason publishers were not closed before on stop.
The PR removing the call to cancelReqs was just a quick patch to keep main
releasable, I've just created a new PR with the proper fix.
For confirmed bugs, please report:
main
,v8.15.1
Steps to reproduce
The configuration I used: