Adding Connector to Preexisting Group Gives Unknown Host Exception

chenchik commented 7 years ago

I have been experimenting with scaling working and seeing how it improves performance or CPU usage per container/pod in openshift.

I'm stuck on a particular issue right now though where I am unable to add another hdfs connector to an already pre existing group of workers. One worker always has a log that ends up looking like an infinite series of unknown host exceptions:

[2017-07-18 08:11:54,707] INFO SinkConnectorConfig values:
    connector.class = io.confluent.connect.hdfs.HdfsSinkConnector
    key.converter = class org.apache.kafka.connect.storage.StringConverter
    name = 07-19-1146am
    tasks.max = 5
    topics = [p-5]
    transforms = null
    value.converter = class org.apache.kafka.connect.storage.StringConverter
 (org.apache.kafka.connect.runtime.SinkConnectorConfig)
[2017-07-18 08:11:54,708] ERROR Unexpected error during connector task reconfiguration:  (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
[2017-07-18 08:11:54,708] ERROR Task reconfiguration for 07-19-1146am failed unexpectedly, this connector will not be properly reconfigured unless manually triggered. (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
[2017-07-18 08:11:54,720] ERROR IO error forwarding REST request:  (org.apache.kafka.connect.runtime.rest.RestServer)
java.net.UnknownHostException: connect47
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at java.net.Socket.connect(Socket.java:538)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
    at sun.net.www.http.HttpClient.New(HttpClient.java:308)
    at sun.net.www.http.HttpClient.New(HttpClient.java:326)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
    at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1283)
    at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1258)
    at org.apache.kafka.connect.runtime.rest.RestServer.httpRequest(RestServer.java:216)
    at org.apache.kafka.connect.runtime.distributed.DistributedHerder$18.run(DistributedHerder.java:992)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
[2017-07-18 08:11:54,724] ERROR Request to leader to reconfigure connector tasks failed (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
org.apache.kafka.connect.runtime.rest.errors.ConnectRestException: IO Error trying to forward REST request: connect47
    at org.apache.kafka.connect.runtime.rest.RestServer.httpRequest(RestServer.java:242)
    at org.apache.kafka.connect.runtime.distributed.DistributedHerder$18.run(DistributedHerder.java:992)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: connect47
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at java.net.Socket.connect(Socket.java:538)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
    at sun.net.www.http.HttpClient.New(HttpClient.java:308)
    at sun.net.www.http.HttpClient.New(HttpClient.java:326)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
    at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1283)
    at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1258)
    at org.apache.kafka.connect.runtime.rest.RestServer.httpRequest(RestServer.java:216)
    ... 6 more
[2017-07-18 08:11:54,728] ERROR Failed to reconfigure connector's tasks, retrying after backoff: (org.apache.kafka.connect.runtime.distributed.DistributedHerder)

If I try to remove and add another connector to simply one worker, everything works fine. Also if I start out with one worker, add a connector, and then scale the workers up to the amount of max.tasks, they all work great together. But I have a group of workers (2 or greater) already working together and I try to add a connector, these unknown hos exceptions pop up on one of the workers and prevent the entire group of workers from getting anything done.

I'm using an updated version of the hdfs connector where there was an equality bug which was fixed recently in their github. I took the jar file from the Jenkin's build:

Here is the issue I'm referencing:

https://github.com/confluentinc/kafka-connect-hdfs/issues/132

Jenkins build:

https://jenkins.confluent.io/job/kafka-connect-hdfs-pr/99/io.confluent$kafka-connect-hdfs/

This is probably also probably related to the fact that when I try to make anything except a GET request to a group of pods/containers. Most of the time, I get this kind of response:

{
    "error_code": 500,
    "message": "IO Error trying to forward REST request: connect47"
}

Both of these issues were occurring all the time before I upgraded my hdfs connector as well. How can I prevent this unknown host exception from happening?

michyliao commented 5 years ago

Hi, were you able to figure this out?

OneCricketeer commented 5 years ago

I think the problem here is related to setting advertised.host.name and/or the port in Connect.

The advertised address will need to be externally resolvable by that container, and new joining ones will need to be able to resolve that name.

confluentinc / cp-docker-images

Adding Connector to Preexisting Group Gives Unknown Host Exception #296