dropwizard / metrics

:chart_with_upwards_trend: Capturing JVM- and application-level metrics. So you know what's going on.
https://metrics.dropwizard.io
Apache License 2.0
7.84k stars 1.81k forks source link

Graphite reporter does not detect connections in CLOSE_WAIT state #917

Closed bwhaley closed 7 years ago

bwhaley commented 8 years ago

I have an environment with two Graphite carbon-relays behind an ELB. The metrics reporter connects to the ELB which proxies to the relays. This mostly works except that the IP address of the ELB changes periodically. When this happens, the ELB sends a FIN and the connection should be closed. At that point the client needs to do another DNS lookup to find the new IP address. This works fine with other clients (for example, collectd) but not with v3.1.2 of the metrics reporter. Connections remain in CLOSE_WAIT.

To wit:

$ sudo netstat -pan | grep 2003 | awk '{print $6,$7}'
CLOSE_WAIT  14884/java
ESTABLISHED 8077/collectd

This is several hours after the IP address has changed. Note that collectd has picked up the new ELB IP address but the java process hasn't. The related stack trace:

[WARN ] 2016-02-11 20:08:02.982 [metrics-graphite-reporter-1-thread-1] GraphiteReporter - Unable to report to Graphite
java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0_51]
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) ~[?:1.8.0_51]
        at java.net.SocketOutputStream.write(SocketOutputStream.java:153) ~[?:1.8.0_51]
        at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) ~[?:1.8.0_51]
        at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291) ~[?:1.8.0_51]
        at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295) ~[?:1.8.0_51]
        at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141) ~[?:1.8.0_51]
        at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) ~[?:1.8.0_51]
        at java.io.BufferedWriter.flush(BufferedWriter.java:254) ~[?:1.8.0_51]
        at com.codahale.metrics.graphite.Graphite.flush(Graphite.java:151) ~[metrics-graphite-3.1.2.jar!/:3.1.2]
        at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:190) [metrics-graphite-3.1.2.jar!/:3.1.2]
        at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162) [metrics-core-3.1.2.jar!/:3.1.2]
        at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117) [metrics-core-3.1.2.jar!/:3.1.2]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_51]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_51]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_51]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_51]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_51]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_51]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_51]

I think the problem lies in isConnected() in Graphite.java.

    public boolean isConnected() {
            return socket != null && socket.isConnected() && !socket.isClosed();
    }

socket.isConnected() and socket.isClosed() is not sufficient for checking whether the remote end has closed the connection. Instead, it needs a read(), which will return -1 if the connection is no longer open.

bwhaley commented 8 years ago

Note that PR #899 may fix this as well, though it seems inefficient to close & reopen a socket for every report.

marshallpierce commented 8 years ago

It may well be that closed socket detection is happening wrong, but it sounds like the real problem is Java's weird DNS caching behavior... see http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html for relevant system properties, or http://www.rgagnon.com/javadetails/java-0445.html or http://stackoverflow.com/questions/1256556/any-way-to-make-java-honor-the-dns-caching-timeout-ttl for explanation. (from a quick googling)

bwhaley commented 8 years ago

I'm no stranger to Java DNS caching issues, that could very well be at fault here. Are you suggesting I should set networkaddress.cache.ttl to match the ELB's TTL (which is 60s)?

bwhaley commented 8 years ago

Will try to pass -Dsun.net.inetaddr.ttl=60 to the app's startup options.

marshallpierce commented 8 years ago

Yeah, I'd try setting the TTL. But that doesn't mean you're wrong about socket detection.

bwhaley commented 8 years ago

Can confirm the socket is not closing, so no new connection and hence no DNS lookup, though that setting is possibly also needed. Here's how I tested -

collectd & java clients <=> ELB <=> 2xCarbon-relay

When an ELB's connections to its listeners are reset, it also closes connections to its clients. So I can induce a CLOSE_WAIT by restarting the carbon-relay processes.

First, collectd and java both in CLOSE_WAIT following the carbon-relay restart:

$ sudo netstat -pan | grep 2003  | awk '{print $6,$7}'
CLOSE_WAIT 8077/collectd
CLOSE_WAIT 1227/java

After some time, collectd kills the CLOSE_WAIT, does a DNS lookup (witnessed via sudo tcpdump -vvv -s 0 -l -n port 53) and has a new connection. Java remains in CLOSE_WAIT and has a bad time.

$ sudo netstat -pan | grep 2003  | awk '{print $6,$7}'
CLOSE_WAIT 1227/java
ESTABLISHED 8077/collectd

Fun fact, I debugged the same problem in collectd/collectd#1104.

jmason commented 8 years ago

Unfortunately, the graphite protocol is one-way -- writes only, with nothing for the client to read, as far as I know.

But shouldn't the write() calls at https://github.com/dropwizard/metrics/blob/068821a9798ddf81c4fd457b8fe1df09a0a99e22/metrics-graphite/src/main/java/io/dropwizard/metrics/graphite/Graphite.java#L135 be throwing IOExceptions when the socket is in CLOSE_WAIT?

ryantenney commented 8 years ago

The socket's output stream is wrapped with a BufferredWriter, so that exception isn't thrown until the call to 'flush()'

jmason commented 8 years ago

actually, looking at that stack trace and the first comment, it looks like the problem is that it is catching and logging that exception -- but not triggering a reconnect when that occurs. Is that correct? Potentially catching a SocketException and closing the socket in send() would be an ok workaround.

timothyshelton commented 8 years ago

Yes, correct, the reconnect is not being triggered.

bwhaley commented 8 years ago

+1 to closing the socket in send()

Still, java.net.Socket isConnected() only confirms the socket has been connected at some point. It is not a connection test, which is what report() is looking for before it starts writing (here). Seems like Graphite isConnected() should determine what the state of the connection is, not only the socket.

miguelnmiranda commented 8 years ago

You shouldn't need to close the socket on send(). Event if Graphite isConnected() is unreliable and it causes a failure on either Graphite flush() or Graphite send() it will result in Graphite close() to be called (assuming IOException is thrown). You lose the current tick, but next time the scheduler comes around Graphite socket will be null and it will try to reconnect. So the current behavior is already to reconnect when SocketException is caught although not immediately.

If something other than a IOException or RuntimeException is thrown the scheduler will stop without logging that it did. On the current HEAD of master the catch on GraphiteReporter report() was changed to Throwable forcing a Graphite close() no matter what.

bwhaley commented 8 years ago

This one cropped up again recently when the ELB IP addresses rotated. I observer the metrics reporter holding on to the CLOSE_WAIT socket indefinitely. So though Miguel's analysis seems correct and the socket should be null (e.g. via Graphite close()), it is in fact still in CLOSE_WAIT and the reporter continues to send data over the dead socket.

$ sudo netstat -anp | grep 2003
tcp        0      0 www:45896         yyy:2003           ESTABLISHED 4566/collectd
tcp        1      0 xxx:45759           zzz:2003           CLOSE_WAIT  4411/java
$ sudo tcpdump src port 45759 and dst port 2003
[Lots of traffic from java over the CLOSE_WAIT socket]
yogeshnath commented 8 years ago

Was there any solution to this? We are running into the same issue and decided to go with static IP on the relay as oppose to going through the ILB.

arteam commented 7 years ago

I guess #899 should fix that. Now the client opens a new socket every times it sends a new portion of metrics and closes the socket after it's done. The fix was merged to the 3.1 branch.