Open strawgate opened 1 month ago
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
This appears to be called during the keep alive session maintenance every time data is read from the buffer? The last data in the buffer is the 200 OK from the server so we are receiving the full response.
This error does not happen only in this situation, it can happen with a legit interrupted connection by either side. I don't think it's a good idea to fully conceal this error. It would make it difficult to find some networking issues (proxy?) on the customer's side.
So, we cannot simply change https://github.com/elastic/elastic-agent-libs/blob/01275338dc278335b4a8c0f23055014b4c0702dc/transport/logging.go#L48-L54 to ignore the error. If we want to address this issue right, we need to find the way to handle this error in those "keep alive session maintenance" requests instead. The error is simply returned from the Read
function, so it can be handled level higher.
I think we should find the places where those "keep alive requests" happen and send them without the logger connection wrapper at all. We should not care about their success generally.
For example, we could add a new function Connection() net.Conn
to the logging wrapper that returns the underlying connection, that you can use without logging. That would need some ugly type casting though, so I'm not sure.
I think we should find the places where those "keep alive requests" happen and send them without the logger connection wrapper at all. We should not care about their success generally.
The keepalives are most probably on the connections to the Elasticsearch _bulk endpoint in the Beats. That is pretty much the only network connection they regularly make.
The idle connection timeout is managed within the net/http implementation: https://cs.opensource.google/go/go/+/master:src/net/http/transport.go;drc=77e42fdeaf98d241dc09f4eb92e09225aea8f4c3;l=1090
This appears to be called during the keep alive session maintenance every time data is read from the buffer? The last data in the buffer is the 200 OK from the server so we are receiving the full response.
@strawgate sounds like you spent some time debugging it, could you give us some pointers? Where did you see this HTTP request and how did you read the buffer?
If you apply the breakpoint at https://github.com/elastic/elastic-agent-libs/blob/4babafd5ed1e5079acf74212ed3da01740b22de7/transport/logging.go#L50 and then wait for it to hit, do one step out, you'll land in net/http/transport.go at Read and you can read the buffer from the vscode debugger under p
:
Example:
"HTTP/1.1 200 OK\r\nContent-Encoding: gzip\r\nContent-Type: application/json\r\nX-Cloud-Request-Id: hx4DEVDVTJettHh_3xDChw\r\nX-Elastic-Product: Elasticsearch\r\nX-Found-Handling-Cluster: 856e0d3694624bf18dffce04dd4505f0\r\nX-Found-Handling-Instance: instance-0000000002\r\nDate: Thu, 24 Oct 2024 14:24:19 GMT\r\nContent-Length: 124\r\n\r\n\x1f\x8b\b\x00\x00\x00\x00\x00\x00\x00\xec\x971\n\x80@\f\x04\xff\x92\xfa\n\xb5\xbc\xaf\x88\xc5!\x11\x04EHbu\xe4\xef\xde'\xec\xa6L҅dw\xa7\x8b\x9a=\xe6R\x8fv\xb9\x169C\xefQ\xad]v\xd3\x16*\xb5\x8bG\x8bw4\x97i\xce,L\xd8\x01w\xc0/\xa0\ah\"\xbe\x807\x92\x0f\xc8H\xe4D\xb22\xbc\x00\x19AF\x90\x11d\xf4\v\x19m\xf9\x01\x00\x00\xff\xff\x03\x00PY\xab8\xac\x11\x00\x00\xcd1f-\x0e\f\x7f\x1e\xd8\xd1\x04\xc5\xfa\xbaTNXoN6\xac\xc1\t\xffvˋ\xb9\xae\xf0\xd7\xc7\x04\xaa\xfe\x12r\xeb(\xf3\x1e\xd8\xf7\xf4\xc7\xe7\x04\xfadt\xd2*\xa1Fǎϩd\xc15\x92k\x94J\xd4?\x8d\x13...+3584 more"
When I first looked at this I wondered if we could catch this by not logging when we get an error but also a non-zero n
indicating we actually read data. Yes the connection closed, also we read something from it, so the fact that it closed isn't interesting yet.
func (l *loggingConn) Read(b []byte) (int, error) {
n, err := l.Conn.Read(b)
if err != nil && !errors.Is(err, io.EOF) {
l.logger.Debugf("Error reading from connection: %v", err)
}
return n, err
}
When an idle connection is torn down by the beat, Error reading from connection: read tcp y.y.y.y:54668->x.x.x.x:443: use of closed network connection is logged by github.com/elastic/elastic-agent-libs/transport.(*loggingConn).Read"}
Hello @strawgate, can you list the steps to reproduce this error. This was required to test the fix, thanks
I believe all you need to do is start a relatively new metricbeat (i was using main
) with default settings (enable the system module) with debug logging enabled logging.level: debug
and this will get logged continuously.
When I first looked at this I wondered if we could catch this by not logging when we get an error but also a non-zero n indicating we actually read data. Yes the connection closed, also we read something from it, so the fact that it closed isn't interesting yet.
I tried this locally, but it still throws an error. The buffer may or may not be empty when keep alive
session attempts to read - it is not guaranteed.
I think we should find the places where those "keep alive requests" happen and send them without the logger connection wrapper at all. We should not care about their success generally.
The logger wrapper is attached to connections made to ES. https://github.com/khushijain21/beats/blob/main/libbeat/esleg/eslegclient/connection.go#L164. I think we may risk losing "all" connection details if we remove it
Any suggestion is appreciated.
Just to summarize my thoughts on this issue:
When an idle connection is torn down by the beat,
Error reading from connection: read tcp y.y.y.y:54668->x.x.x.x:443: use of closed network connection
is logged bygithub.com/elastic/elastic-agent-libs/transport.(*loggingConn).Read"}
Here: https://github.com/elastic/elastic-agent-libs/blob/01275338dc278335b4a8c0f23055014b4c0702dc/transport/logging.go#L48-L54
This appears to be called during the keep alive session maintenance every time data is read from the buffer? The last data in the buffer is the 200 OK from the server so we are receiving the full response.
When the connection is torn down, this receives an errNetClosing instead of an EOF (likely because we are the ones closing the connection and not the server. This code path only checks for EOF and thus we get the
use of closed network connection
err debug logged during the read.This error message “pollutes” our debug logs and leads customers to believing there is a network issue.