Open DaveCTurner opened 3 years ago
Pinging @elastic/es-distributed (Team:Distributed)
I think we hit exactly this problem in https://gradle-enterprise.elastic.co/s/3lq4o3iswxumo/console-log?task=:server:test which seems like an odd coincidence since this isn't a new bug and yet I've not seen it result in test failures ever before.
Note that #86315 means we no longer need to protect against connection attempts while closing. Should we just remove this (buggy) protection mechanism instead of trying to fix it?
To answer the question raised in this comment I started to look at whether we could block on the
closeLock
and I think the answer is no but also I now have more questions about concurrency during closing. Can we definitely not add toacceptedChannels
after the server channel is closed? Also this comment:https://github.com/elastic/elasticsearch/blob/e6fd459a6e275efa663af0b711c87da3b18d8007/server/src/main/java/org/elasticsearch/transport/TcpTransport.java#L282
Does it really do that? It did in the past when we blocked the thread while the connection completes, but today I think it's possible for us to open a connection and then leak it by shutting the event loop down. It'll be ok for connections that are happening via
ClusterConnectionManager#connectToNode
since we closed all the connection managers down first, but we bypass that mechanism for discovery and sniffing.Not sure there's really a bug here, I haven't tried to reproduce it or anything. If there is a bug then it's benign in production anyway since we're shutting down so everything will be cleaned up soon, but it's still untidy and might cause test issues.