apache / dubbo

The java implementation of Apache Dubbo. An RPC and microservice framework.
https://dubbo.apache.org/
Apache License 2.0
40.4k stars 26.41k forks source link

dubbo consumer cannot stop reconnect to the old provider after provider changed ip in docker #5953

Closed majjfootman closed 4 years ago

majjfootman commented 4 years ago

Environment

Steps to reproduce this issue

  1. Start module A and module B, A depends on B, the register is nacos. Now A invoke B's dubbo interface correct.
  2. Restart module B, because in k8s and module B’s IP address is changed.
  3. Now A invoke B correct, but A begin reconnect to the old provider B and can not stop. The logs as below:

Pls. provide [GitHub address] to reproduce this issue.

What do you expected from the above steps?

stop reconnect to the old provider

What actually happens?

Actual Result is cannot stop reconnect to the old provider

If there is an exception, please attach the exception trace:

[ERROR] 2020-03-31 22:16:19.303 [] [] [dubbo-client-idleCheck-thread-1] ReconnectTimerTask.doTask[51]: [DUBBO] Fail to connect to HeaderExchangeClient [channel=org.apache.dubbo.remoting.transport.netty4.NettyClient [10.42.1.119:0 -> /10.42.1.115:20881]], dubbo version: 2.7.4.1, current host: 10.42.1.119
org.apache.dubbo.remoting.RemotingException: client(url: dubbo://10.42.1.115:20881/com.alibaba.cloud.dubbo.service.DubboMetadataService?anyhost=true&application=mall-open&bind.ip=10.42.1.115&bind.port=20881&check=false&codec=dubbo&deprecated=false&dubbo=2.0.2&dynamic=true&generic=true&group=mall-goods&heartbeat=60000&interface=com.alibaba.cloud.dubbo.service.DubboMetadataService&lazy=false&methods=getAllServiceKeys,getServiceRestMetadata,getExportedURLs,getAllExportedURLs&pid=6&qos.enable=false&register.ip=10.42.1.119&release=2.7.4.1&remote.application=mall-goods&revision=2.2.0.RELEASE&side=consumer&sticky=false&timeout=30000&timestamp=1585642448627&version=1.0.0) failed to connect to server /10.42.1.115:20881 client-side timeout 3000ms (elapsed: 3000ms) from netty client 10.42.1.119 using dubbo version 2.7.4.1
at org.apache.dubbo.remoting.transport.netty4.NettyClient.doConnect(NettyClient.java:166)
at org.apache.dubbo.remoting.transport.AbstractClient.connect(AbstractClient.java:190)
at org.apache.dubbo.remoting.transport.AbstractClient.reconnect(AbstractClient.java:246)
at org.apache.dubbo.remoting.exchange.support.header.HeaderExchangeClient.reconnect(HeaderExchangeClient.java:155)
at org.apache.dubbo.remoting.exchange.support.header.ReconnectTimerTask.doTask(ReconnectTimerTask.java:49)
at org.apache.dubbo.remoting.exchange.support.header.AbstractTimerTask.run(AbstractTimerTask.java:87)
at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:648)
at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:727)
at org.apache.dubbo.common.timer.HashedWheelTimer$Worker.run(HashedWheelTimer.java:449)
at java.lang.Thread.run(Thread.java:748)
[ERROR] 2020-03-31 22:17:23.303 [] [] [dubbo-client-idleCheck-thread-1] ReconnectTimerTask.doTask[51]: [DUBBO] Fail to connect to HeaderExchangeClient [channel=org.apache.dubbo.remoting.transport.netty4.NettyClient [10.42.1.119:0 -> /10.42.1.115:20881]], dubbo version: 2.7.4.1, current host: 10.42.1.119
org.apache.dubbo.remoting.RemotingException: client(url: dubbo://10.42.1.115:20881/com.alibaba.cloud.dubbo.service.DubboMetadataService?anyhost=true&application=mall-open&bind.ip=10.42.1.115&bind.port=20881&check=false&codec=dubbo&deprecated=false&dubbo=2.0.2&dynamic=true&generic=true&group=mall-goods&heartbeat=60000&interface=com.alibaba.cloud.dubbo.service.DubboMetadataService&lazy=false&methods=getAllServiceKeys,getServiceRestMetadata,getExportedURLs,getAllExportedURLs&pid=6&qos.enable=false&register.ip=10.42.1.119&release=2.7.4.1&remote.application=mall-goods&revision=2.2.0.RELEASE&side=consumer&sticky=false&timeout=30000&timestamp=1585642448627&version=1.0.0) failed to connect to server /10.42.1.115:20881 client-side timeout 3000ms (elapsed: 3000ms) from netty client 10.42.1.119 using dubbo version 2.7.4.1
at org.apache.dubbo.remoting.transport.netty4.NettyClient.doConnect(NettyClient.java:166)
at org.apache.dubbo.remoting.transport.AbstractClient.connect(AbstractClient.java:190)
at org.apache.dubbo.remoting.transport.AbstractClient.reconnect(AbstractClient.java:246)
at org.apache.dubbo.remoting.exchange.support.header.HeaderExchangeClient.reconnect(HeaderExchangeClient.java:155)
at org.apache.dubbo.remoting.exchange.support.header.ReconnectTimerTask.doTask(ReconnectTimerTask.java:49)
at org.apache.dubbo.remoting.exchange.support.header.AbstractTimerTask.run(AbstractTimerTask.java:87)
at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:648)
at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:727)
at org.apache.dubbo.common.timer.HashedWheelTimer$Worker.run(HashedWheelTimer.java:449)
at java.lang.Thread.run(Thread.java:748)
[ERROR] 2020-03-31 22:18:27.303 [] [] [dubbo-client-idleCheck-thread-1] ReconnectTimerTask.doTask[51]: [DUBBO] Fail to connect to HeaderExchangeClient [channel=org.apache.dubbo.remoting.transport.netty4.NettyClient [10.42.1.119:0 -> /10.42.1.115:20881]], dubbo version: 2.7.4.1, current host: 10.42.1.119
org.apache.dubbo.remoting.RemotingException: client(url: dubbo://10.42.1.115:20881/com.alibaba.cloud.dubbo.service.DubboMetadataService?anyhost=true&application=mall-open&bind.ip=10.42.1.115&bind.port=20881&check=false&codec=dubbo&deprecated=false&dubbo=2.0.2&dynamic=true&generic=true&group=mall-goods&heartbeat=60000&interface=com.alibaba.cloud.dubbo.service.DubboMetadataService&lazy=false&methods=getAllServiceKeys,getServiceRestMetadata,getExportedURLs,getAllExportedURLs&pid=6&qos.enable=false&register.ip=10.42.1.119&release=2.7.4.1&remote.application=mall-goods&revision=2.2.0.RELEASE&side=consumer&sticky=false&timeout=30000&timestamp=1585642448627&version=1.0.0) failed to connect to server /10.42.1.115:20881 client-side timeout 3000ms (elapsed: 3000ms) from netty client 10.42.1.119 using dubbo version 2.7.4.1
at org.apache.dubbo.remoting.transport.netty4.NettyClient.doConnect(NettyClient.java:166)
at org.apache.dubbo.remoting.transport.AbstractClient.connect(AbstractClient.java:190)
at org.apache.dubbo.remoting.transport.AbstractClient.reconnect(AbstractClient.java:246)
at org.apache.dubbo.remoting.exchange.support.header.HeaderExchangeClient.reconnect(HeaderExchangeClient.java:155)
at org.apache.dubbo.remoting.exchange.support.header.ReconnectTimerTask.doTask(ReconnectTimerTask.java:49)
at org.apache.dubbo.remoting.exchange.support.header.AbstractTimerTask.run(AbstractTimerTask.java:87)
at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:648)
at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:727)
at org.apache.dubbo.common.timer.HashedWheelTimer$Worker.run(HashedWheelTimer.java:449)
at java.lang.Thread.run(Thread.java:748)
AlbumenJ commented 4 years ago

Perhaps you can check your nacos properties of live times. It seems as if the old version provider url is still in nacos.

majjfootman commented 4 years ago

After check my nacos client and server properties,I am using the default properties configuration, and the client as blow:

dubbo:
  scan:
    base-packages: xxxxxxxx
  consumer:
    check: false
    timeout: 30000
  protocol:
    name: dubbo
    port: 20881
  registry:
    address: nacos://xxxxxxx:8848
  cloud:
    subscribed-services: xxxxxxxxx

spring:
  application:
    name: xxxxxxxx
  profiles:
    active: dev
  cloud:
    nacos:
      config:
        server-addr: xxxxxxxx:8848
      discovery:
        server-addr: xxxxxxx:8848
majjfootman commented 4 years ago

When the reconnect happens, I had tried to restart the nacos server, but the reconnect stills can not stop. So my dubbo consumer continues trying to reconnect the old provider ip for several hours, and the exception logs is continuing output. Is it normally

AlbumenJ commented 4 years ago

What I said nacos properties is in you nacos server. Can you reproduce this issue and check nacos on its web management (http://xxxxxx:8848/nacos/) ? You can notice the Healthy Instance Count in ServiceManagement - Service List.

majjfootman commented 4 years ago

I had checked the nacos web pannel, the old provider was not show in the service list any more, and the Healthy Instance Count is '1' which is the new provider

AlbumenJ commented 4 years ago

What about the Subscribers and Listening Query sections are correct or not?

AlbumenJ commented 4 years ago

Just now I try to reproduce this issue in my computer, I found the module A will still try to reconnect the old version B before this new version's url notified by nacos and old version healthy checkout timeout. And this may takes about two minutes before stopping reconnecting to the old version provider. After that everything returns to normal. @majjfootman Did module A still try to reconnect old provider B after serveral minutes ?

majjfootman commented 4 years ago

Moudule A still try to reconnect old provider B after serveral minutes, even after serveral hours. May be my config is not correct, or the nacos server version is not compatibility? My nacos server version is 1.1.3, and I am using the spring-cloud-alibaba 2.2.0.RELEASE. Can you share your module's version or configuration, Thank you very much

AlbumenJ commented 4 years ago

I am just dirctly use dubbo without spring-cloud-alibaba.

        <!-- dubbo -->
        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-spring-boot-starter</artifactId>
            <version>${dubbo-version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-registry-nacos</artifactId>
            <version>${dubbo-version}</version>
            <exclusions>
                <exclusion>
                    <artifactId>log4j</artifactId>
                    <groupId>log4j</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>dubbo-common</artifactId>
                    <groupId>org.apache.dubbo</groupId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>com.alibaba.nacos</groupId>
            <artifactId>nacos-client</artifactId>
            <version>${nacos-version}</version>
        </dependency>

properties is below

dubbo:
  scan:
    base-packages: com.xxxxxx
  protocol:
    name: dubbo
    port: -1
  registry:
    address: nacos://127.0.0.1:8848
  application:
    name: xxxxxx
  consumer:
    check: false
AlbumenJ commented 4 years ago

Just now I notice your error message, the main error is client-side timeout. Perhaps this is the main reason because in general if a server shutdown the client reconnect it will come out Connection refused owing to closed port. But in k8s if a pod is totally closed, the ip cannot be connect no longer and the result may changed while the request is loopback in the k8s network or even turned to outside network. This maybe the issue of doClient function in NettyClient.

majjfootman commented 4 years ago

Thank you for your help. I run my application at my local environment and reproduce, the connection exception type changes but still continue reconnect. Then I use your dirctly maven dependency instand of spring cloud alibaba, it was not reconnecting any more. I will try this in k8s later. May be the continues reconnection is the issue of spring cloud alibaba.

wghdir commented 4 years ago

的确是重启provider会这样,我在consumer上调用几次,发现就在不停的连接,注册中心返回的数据查了下,没有原来的ip数据。我的docker用的是swarm网络,容器重启后ip会变,计算机名不变。