Closed bwhaley closed 7 years ago
Note that PR #899 may fix this as well, though it seems inefficient to close & reopen a socket for every report.
It may well be that closed socket detection is happening wrong, but it sounds like the real problem is Java's weird DNS caching behavior... see http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html for relevant system properties, or http://www.rgagnon.com/javadetails/java-0445.html or http://stackoverflow.com/questions/1256556/any-way-to-make-java-honor-the-dns-caching-timeout-ttl for explanation. (from a quick googling)
I'm no stranger to Java DNS caching issues, that could very well be at fault here. Are you suggesting I should set networkaddress.cache.ttl
to match the ELB's TTL (which is 60s)?
Will try to pass -Dsun.net.inetaddr.ttl=60
to the app's startup options.
Yeah, I'd try setting the TTL. But that doesn't mean you're wrong about socket detection.
Can confirm the socket is not closing, so no new connection and hence no DNS lookup, though that setting is possibly also needed. Here's how I tested -
collectd & java clients <=> ELB <=> 2xCarbon-relay
When an ELB's connections to its listeners are reset, it also closes connections to its clients. So I can induce a CLOSE_WAIT by restarting the carbon-relay processes.
First, collectd and java both in CLOSE_WAIT
following the carbon-relay restart:
$ sudo netstat -pan | grep 2003 | awk '{print $6,$7}'
CLOSE_WAIT 8077/collectd
CLOSE_WAIT 1227/java
After some time, collectd kills the CLOSE_WAIT
, does a DNS lookup (witnessed via sudo tcpdump -vvv -s 0 -l -n port 53
) and has a new connection. Java remains in CLOSE_WAIT
and has a bad time.
$ sudo netstat -pan | grep 2003 | awk '{print $6,$7}'
CLOSE_WAIT 1227/java
ESTABLISHED 8077/collectd
Fun fact, I debugged the same problem in collectd/collectd#1104.
Unfortunately, the graphite protocol is one-way -- writes only, with nothing for the client to read, as far as I know.
But shouldn't the write() calls at https://github.com/dropwizard/metrics/blob/068821a9798ddf81c4fd457b8fe1df09a0a99e22/metrics-graphite/src/main/java/io/dropwizard/metrics/graphite/Graphite.java#L135 be throwing IOExceptions when the socket is in CLOSE_WAIT?
The socket's output stream is wrapped with a BufferredWriter, so that exception isn't thrown until the call to 'flush()'
actually, looking at that stack trace and the first comment, it looks like the problem is that it is catching and logging that exception -- but not triggering a reconnect when that occurs. Is that correct? Potentially catching a SocketException and closing the socket in send() would be an ok workaround.
Yes, correct, the reconnect is not being triggered.
+1 to closing the socket in send()
Still, java.net.Socket isConnected()
only confirms the socket has been connected at some point. It is not a connection test, which is what report()
is looking for before it starts writing (here). Seems like Graphite isConnected()
should determine what the state of the connection is, not only the socket.
You shouldn't need to close the socket on send()
. Event if Graphite isConnected()
is unreliable and it causes a failure on either Graphite flush()
or Graphite send()
it will result in Graphite close()
to be called (assuming IOException
is thrown). You lose the current tick, but next time the scheduler comes around Graphite socket
will be null and it will try to reconnect.
So the current behavior is already to reconnect when SocketException
is caught although not immediately.
If something other than a IOException
or RuntimeException
is thrown the scheduler will stop without logging that it did. On the current HEAD of master the catch on GraphiteReporter report()
was changed to Throwable
forcing a Graphite close()
no matter what.
This one cropped up again recently when the ELB IP addresses rotated. I observer the metrics reporter holding on to the CLOSE_WAIT socket indefinitely. So though Miguel's analysis seems correct and the socket should be null (e.g. via Graphite close()
), it is in fact still in CLOSE_WAIT and the reporter continues to send data over the dead socket.
$ sudo netstat -anp | grep 2003
tcp 0 0 www:45896 yyy:2003 ESTABLISHED 4566/collectd
tcp 1 0 xxx:45759 zzz:2003 CLOSE_WAIT 4411/java
$ sudo tcpdump src port 45759 and dst port 2003
[Lots of traffic from java over the CLOSE_WAIT socket]
Was there any solution to this? We are running into the same issue and decided to go with static IP on the relay as oppose to going through the ILB.
I guess #899 should fix that. Now the client opens a new socket every times it sends a new portion of metrics and closes the socket after it's done. The fix was merged to the 3.1 branch.
I have an environment with two Graphite carbon-relays behind an ELB. The metrics reporter connects to the ELB which proxies to the relays. This mostly works except that the IP address of the ELB changes periodically. When this happens, the ELB sends a FIN and the connection should be closed. At that point the client needs to do another DNS lookup to find the new IP address. This works fine with other clients (for example, collectd) but not with v3.1.2 of the metrics reporter. Connections remain in CLOSE_WAIT.
To wit:
This is several hours after the IP address has changed. Note that collectd has picked up the new ELB IP address but the java process hasn't. The related stack trace:
I think the problem lies in
isConnected()
inGraphite.java
.socket.isConnected()
andsocket.isClosed()
is not sufficient for checking whether the remote end has closed the connection. Instead, it needs aread()
, which will return -1 if the connection is no longer open.