elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
95 stars 4.92k forks source link

Unnecessary debug/error logged during idle connection teardown #40824

Open strawgate opened 1 month ago

strawgate commented 1 month ago

When an idle connection is torn down by the beat, Error reading from connection: read tcp y.y.y.y:54668->x.x.x.x:443: use of closed network connection is logged by github.com/elastic/elastic-agent-libs/transport.(*loggingConn).Read"}

Here: https://github.com/elastic/elastic-agent-libs/blob/01275338dc278335b4a8c0f23055014b4c0702dc/transport/logging.go#L48-L54

This appears to be called during the keep alive session maintenance every time data is read from the buffer? The last data in the buffer is the 200 OK from the server so we are receiving the full response.

When the connection is torn down, this receives an errNetClosing instead of an EOF (likely because we are the ones closing the connection and not the server. This code path only checks for EOF and thus we get the use of closed network connection err debug logged during the read.

Image

This error message “pollutes” our debug logs and leads customers to believing there is a network issue.

elasticmachine commented 1 month ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

rdner commented 2 weeks ago

This appears to be called during the keep alive session maintenance every time data is read from the buffer? The last data in the buffer is the 200 OK from the server so we are receiving the full response.

This error does not happen only in this situation, it can happen with a legit interrupted connection by either side. I don't think it's a good idea to fully conceal this error. It would make it difficult to find some networking issues (proxy?) on the customer's side.

So, we cannot simply change https://github.com/elastic/elastic-agent-libs/blob/01275338dc278335b4a8c0f23055014b4c0702dc/transport/logging.go#L48-L54 to ignore the error. If we want to address this issue right, we need to find the way to handle this error in those "keep alive session maintenance" requests instead. The error is simply returned from the Read function, so it can be handled level higher.

rdner commented 2 weeks ago

I think we should find the places where those "keep alive requests" happen and send them without the logger connection wrapper at all. We should not care about their success generally.

For example, we could add a new function Connection() net.Conn to the logging wrapper that returns the underlying connection, that you can use without logging. That would need some ugly type casting though, so I'm not sure.

cmacknz commented 2 weeks ago

I think we should find the places where those "keep alive requests" happen and send them without the logger connection wrapper at all. We should not care about their success generally.

The keepalives are most probably on the connections to the Elasticsearch _bulk endpoint in the Beats. That is pretty much the only network connection they regularly make.

The idle connection timeout is managed within the net/http implementation: https://cs.opensource.google/go/go/+/master:src/net/http/transport.go;drc=77e42fdeaf98d241dc09f4eb92e09225aea8f4c3;l=1090

rdner commented 2 weeks ago

This appears to be called during the keep alive session maintenance every time data is read from the buffer? The last data in the buffer is the 200 OK from the server so we are receiving the full response.

@strawgate sounds like you spent some time debugging it, could you give us some pointers? Where did you see this HTTP request and how did you read the buffer?

strawgate commented 2 weeks ago

If you apply the breakpoint at https://github.com/elastic/elastic-agent-libs/blob/4babafd5ed1e5079acf74212ed3da01740b22de7/transport/logging.go#L50 and then wait for it to hit, do one step out, you'll land in net/http/transport.go at Read and you can read the buffer from the vscode debugger under p: Image

Example: "HTTP/1.1 200 OK\r\nContent-Encoding: gzip\r\nContent-Type: application/json\r\nX-Cloud-Request-Id: hx4DEVDVTJettHh_3xDChw\r\nX-Elastic-Product: Elasticsearch\r\nX-Found-Handling-Cluster: 856e0d3694624bf18dffce04dd4505f0\r\nX-Found-Handling-Instance: instance-0000000002\r\nDate: Thu, 24 Oct 2024 14:24:19 GMT\r\nContent-Length: 124\r\n\r\n\x1f\x8b\b\x00\x00\x00\x00\x00\x00\x00\xec\x971\n\x80@\f\x04\xff\x92\xfa\n\xb5\xbc\xaf\x88\xc5!\x11\x04EHbu\xe4\xef\xde'\xec\xa6L҅dw\xa7\x8b\x9a=\xe6R\x8fv\xb9\x169C\xefQ\xad]v\xd3\x16*\xb5\x8bG\x8bw4\x97i\xce,L\xd8\x01w\xc0/\xa0\ah\"\xbe\x807\x92\x0f\xc8H\xe4D\xb22\xbc\x00\x19AF\x90\x11d\xf4\v\x19m\xf9\x01\x00\x00\xff\xff\x03\x00PY\xab8\xac\x11\x00\x00\xcd1f-\x0e\f\x7f\x1e\xd8\xd1\x04\xc5\xfa\xbaTNXoN6\xac\xc1\t\xffvˋ\xb9\xae\xf0\xd7\xc7\x04\xaa\xfe\x12r\xeb(\xf3\x1e\xd8\xf7\xf4\xc7\xe7\x04\xfadt\xd2*\xa1Fǎϩd\xc15\x92k\x94J\xd4?\x8d\x13...+3584 more"

cmacknz commented 1 week ago

When I first looked at this I wondered if we could catch this by not logging when we get an error but also a non-zero n indicating we actually read data. Yes the connection closed, also we read something from it, so the fact that it closed isn't interesting yet.

https://github.com/elastic/elastic-agent-libs/blob/4babafd5ed1e5079acf74212ed3da01740b22de7/transport/logging.go#L50-L52

func (l *loggingConn) Read(b []byte) (int, error) {
    n, err := l.Conn.Read(b)
    if err != nil && !errors.Is(err, io.EOF) {
        l.logger.Debugf("Error reading from connection: %v", err)
    }
    return n, err
}
khushijain21 commented 3 days ago

When an idle connection is torn down by the beat, Error reading from connection: read tcp y.y.y.y:54668->x.x.x.x:443: use of closed network connection is logged by github.com/elastic/elastic-agent-libs/transport.(*loggingConn).Read"}

Hello @strawgate, can you list the steps to reproduce this error. This was required to test the fix, thanks

strawgate commented 2 days ago

I believe all you need to do is start a relatively new metricbeat (i was using main) with default settings (enable the system module) with debug logging enabled logging.level: debug and this will get logged continuously.

khushijain21 commented 2 days ago

When I first looked at this I wondered if we could catch this by not logging when we get an error but also a non-zero n indicating we actually read data. Yes the connection closed, also we read something from it, so the fact that it closed isn't interesting yet.

I tried this locally, but it still throws an error. The buffer may or may not be empty when keep alive session attempts to read - it is not guaranteed.

I think we should find the places where those "keep alive requests" happen and send them without the logger connection wrapper at all. We should not care about their success generally.

The logger wrapper is attached to connections made to ES. https://github.com/khushijain21/beats/blob/main/libbeat/esleg/eslegclient/connection.go#L164. I think we may risk losing "all" connection details if we remove it

Any suggestion is appreciated.

rdner commented 2 minutes ago

Just to summarize my thoughts on this issue:

  1. We should definitely keep the log wrapper, we should not consider removing it
  2. The log message must stay because it's generic, if the connection is closed because of a server crash, we still need to log it.
  3. The problem here is not the error message but that we have the idle timeout in the connection now that closes the connection and then tries to use a closed connection. See where it was introduced for more details – this is what we need to find a fix for.