Ecwid / consul-api

Java client for Consul HTTP API
Apache License 2.0
417 stars 175 forks source link

Consul service not reachable. com.ecwid.consul.transport.TransportException: java.net.SocketException: Connection reset #59

Closed jhovell closed 8 years ago

jhovell commented 8 years ago

I am trying to diagnose a problem that I don't know if it is a bug in Consul API (1.1.7) or the way I am running Consul (v 0.6). I get the following stack repeatedly in my logs. I'm using the Consul API in conjunction with a Netflix Archaius wrapper but looking at the code I'm not sure it is a bug there. Seems to be an issue with the way Consul API is trying to create a catalog client.

2016-05-04T19:04:29,435 ERROR [net.researchgate.archaius.ConsulConfigurationSource] [pollingConfigurationSource] Consul service not reachable. i-699d84ae: com.ecwid.consul.transport.TransportException: java.net.SocketException: Connection reset
    at com.ecwid.consul.transport.AbstractHttpTransport.executeRequest(AbstractHttpTransport.java:80) [consul-api-1.1.7.jar:]
    at com.ecwid.consul.transport.AbstractHttpTransport.makeGetRequest(AbstractHttpTransport.java:39) [consul-api-1.1.7.jar:]
    at com.ecwid.consul.v1.ConsulRawClient.makeGetRequest(ConsulRawClient.java:81) [consul-api-1.1.7.jar:]
    at com.ecwid.consul.v1.catalog.CatalogConsulClient.getCatalogService(CatalogConsulClient.java:129) [consul-api-1.1.7.jar:]
    at com.ecwid.consul.v1.catalog.CatalogConsulClient.getCatalogService(CatalogConsulClient.java:122) [consul-api-1.1.7.jar:]
    at com.ecwid.consul.v1.ConsulClient.getCatalogService(ConsulClient.java:284) [consul-api-1.1.7.jar:]
    at net.researchgate.archaius.ConsulConfigurationSource.setClient(ConsulConfigurationSource.java:82) [archaius-consul-0.1.2.jar:]
    at net.researchgate.archaius.ConsulConfigurationSource.poll(ConsulConfigurationSource.java:66) [archaius-consul-0.1.2.jar:]
    at com.netflix.config.AbstractPollingScheduler$1.run(AbstractPollingScheduler.java:163) [archaius-core-0.7.3.jar:0.7.3]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_72-internal]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_72-internal]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_72-internal]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [rt.jar:1.8.0_72-internal]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_72-internal]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_72-internal]
    at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_72-internal]
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:209) [rt.jar:1.8.0_72-internal]
    at java.net.SocketInputStream.read(SocketInputStream.java:141) [rt.jar:1.8.0_72-internal]
    at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) [httpcore-4.3.3.jar:4.3.3]
    at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) [httpcore-4.3.3.jar:4.3.3]
    at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) [httpcore-4.3.3.jar:4.3.3]
    at org.apache.http.impl.conn.LoggingSessionInputBuffer.readLine(LoggingSessionInputBuffer.java:116) [httpclient-4.3.6.jar:4.3.6]
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140) [httpclient-4.3.6.jar:4.3.6]
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) [httpclient-4.3.6.jar:4.3.6]
    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260) [httpcore-4.3.3.jar:4.3.3]
    at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283) [httpcore-4.3.3.jar:4.3.3]
    at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251) [httpclient-4.3.6.jar:4.3.6]
    at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197) [httpclient-4.3.6.jar:4.3.6]
    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271) [httpcore-4.3.3.jar:4.3.3]
    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) [httpcore-4.3.3.jar:4.3.3]
    at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685) [httpclient-4.3.6.jar:4.3.6]
    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487) [httpclient-4.3.6.jar:4.3.6]
    at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) [httpclient-4.3.6.jar:4.3.6]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72) [httpclient-4.3.6.jar:4.3.6]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:214) [httpclient-4.3.6.jar:4.3.6]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:160) [httpclient-4.3.6.jar:4.3.6]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:136) [httpclient-4.3.6.jar:4.3.6]
    at com.ecwid.consul.transport.AbstractHttpTransport.executeRequest(AbstractHttpTransport.java:64) [consul-api-1.1.7.jar:]

Followed by 2 Apache http error messages:

Retrying request to {}->http://172.31.2.106:8300
I/O exception (java.net.SocketException) caught when processing request to {}->http://172.31.2.106:8300: Connection reset

A lot of Googling shows similar errors with connection reset on port 8300 saying that the client should be trying to connect to 8301 instead. Is this some problem in the way I have Consul configured or is this a bug in the Consul client?

Thanks!

spencergibb commented 8 years ago

Why are you trying to connect to 8300? The HTTP api is on 8500 by default if I remember correctly.

jhovell commented 8 years ago

@spencergibb I am not specifying 8300 anywhere in my code. I do notice that the service "consul" resolves to :8300 which is something that the consul server nodes broadcast. I am guessing that is how 8300 is being resolved in Consul API. Is this incorrect or can I change this in my consul configuration?

Here is an example server node

screen shot 2016-05-04 at 12 36 24 pm

The agent nodes register consul-agent-8500 on tcp/8500.

Also I am using Registrator so it is possible that is somehow responsible for these service names that are confusing Consul API.

I'm by no means a Consul expert so a bit lost as to where to start tracking the problem...

vgv commented 8 years ago

Yes, 8500 is default port for HTTP

@jhovell Try to run curl http://172.31.2.106:8500/v1/catalog/nodes

jhovell commented 8 years ago

@vgv

Hide real JSON (vgv)
vgv commented 8 years ago

Thank you.

Your consul agent uses default HTTP port 8500 and you have to use it too. Please, show your code where you construct ConsulClient

jhovell commented 8 years ago

@vgv I'm using the following constructor with this library:

https://github.com/researchgate/archaius-consul/blob/master/src/main/java/net/researchgate/archaius/ConsulConfigurationSource.java#L36

It is initialized with the values

new ConsulConfigurationSource("172.31.3.121:8500,172.31.2.111:8500,172.31.3.216:8500","/my/root/path");
jhovell commented 8 years ago

Also in case it is helpful, this is a problem I only see occasionally. Most Consul clusters I have booted work fine, but occasionally it gets into this state and my only workaround is to destroy the cluster and start again.

Here is what my Consul Agent nodes look like: screen shot 2016-05-04 at 12 46 48 pm

vgv commented 8 years ago

Hmm, looks a bit unusual for me. Why do you start a few consul agents on this one (172.31.3.121) server?

jhovell commented 8 years ago

@vgv yeah I think this is a function of Registrator which - as best I can tell - will register every port found open on the docker container as a "service" and it post-pends port numbers to make the names unique. There is only 1 dockerized Consul agent running on 172.31.3.121, but it has all of those ports open.

I am basing my implementation on this blog post in case it is helpful for context (which includes using Registrator):

https://aws.amazon.com/blogs/compute/service-discovery-via-consul-with-amazon-ecs/

jhovell commented 8 years ago

@vgv I think I see the problem. I don't know if it is an issue with archaius-consul or your library or my setup. It looks like archaius-consul tries to look up the service "consul".

https://github.com/researchgate/archaius-consul/blob/master/src/main/java/net/researchgate/archaius/ConsulConfigurationSource.java#L21

Looking at Consul docs, this looks like this is correct behavior to resolve to port 8300

https://www.consul.io/docs/agent/dns.html (toward the middle-bottom dig @127.0.0.1 -p 8600 consul.service.consul SRV )

So maybe this is a problem with the archaius-consul library that it is supposed to look up not "consul" but "consul-agent-8500"

1) why does this only happen sometimes? - my guess is that the "if" line 53 is sometimes not triggered 2) what is wrong with my consul setup such that "consul" isn't the 8500 agent nodes, but is instead the 8300 server nodes.?

I will keep investigating. Not sure if this is a problem with my Registrator setup or something else.

For reference my dig (I redirect port 8600 to 53 on my consul container):

[ec2-user@ip-172-31-3-121 ~]$ dig @127.0.0.1 -p 53 consul.service.consul. SRV

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.37.rc1.45.amzn1 <<>> @127.0.0.1 -p 53 consul.service.consul. SRV
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36319
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 5

;; QUESTION SECTION:
;consul.service.consul.     IN  SRV

;; ANSWER SECTION:
consul.service.consul.  0   IN  SRV 1 1 8300 i-0fab9fd7.node.usw2.consul.
consul.service.consul.  0   IN  SRV 1 1 8300 i-699d84ae.node.usw2.consul.
consul.service.consul.  0   IN  SRV 1 1 8300 i-0cab9fd4.node.usw2.consul.

;; ADDITIONAL SECTION:
i-0fab9fd7.node.usw2.consul. 0  IN  A   172.31.2.106
i-699d84ae.node.usw2.consul. 0  IN  A   172.31.3.65
i-0cab9fd4.node.usw2.consul. 0  IN  A   172.31.2.105
i-0dab9fd5.node.usw2.consul. 0  IN  A   172.31.2.104
i-6a9d84ad.node.usw2.consul. 0  IN  A   172.31.3.64

;; Query time: 2 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed May  4 20:07:11 2016
;; MSG SIZE  rcvd: 458
jhovell commented 8 years ago

@vgv becoming more convinced this is a problem isolated to https://github.com/researchgate/archaius-consul/ ... looks like setClient() method is being re-called inside an exception block and a new client is being created on port 8300. I think what triggers this is if a node doesn't happen to be available but was still cached in DNS, such as during a rolling update of hardware. I'll take this up with the archaius-consul folks... thanks for helping me debug!