Closed lcg72 closed 4 years ago
当最后一个provider关闭时,AbstractSpringCloudRegistry类将通过EMPTY_PROTOCOL通知dubbo关闭此channel,但url包含一个名为“category”的参数,值为“configurators,routers,providers”,org.apache.dubbo.registry.integration.RegistryDirectory#notify 处理此通知消息时候将校验参数category有效性,只接受一个值,“configurators”或“routers”或“providers”,默认为 “providers”。所以,我删除了com.alibaba.cloud.dubbo.registry.AbstractSpringCloudRegistry中生成emptyUrl时的参数“category”
当最后一个provider关闭时,此channel将关闭,这个时候如果有新的provider注册将引发异常,报此通道已关闭。所以我删除了“repository.initializemtadata(serviceName);issuse #753 的故障现象不再出现,由 @wangzihaogithub 为 issuse #753 提供的补丁建议回滚,因为一些清理操作只需要在最后一个provider关闭时,才需要操作。
重新提交了代码,做了以下事情: 1、注册了meta服务改变事件,只有meta服务挂了,可以判定该客户端已经挂了,这个时候才去清理客户端信息,同时通知dubbo关闭客户端连接。 2、优化了initializeMetadata方法,如果失败,不改变初始化成功标志。这样可以在普通服务实例改变的时候,可以再次进入initializeMetadata方法,同时确保只成功调用一次(调用多次会产生多个proxy,需要避免)。 3、修复生成emptyURLs的bug,确保消息能顺利到达dubbo 4、普通dubbo服务实例变化的时候,只有有实例对象的时候,才需要调用initializeMetadata,没有实例对象说明远程服务已经死了,不需要调用initializeMetadata。因此改变了一下调用位置。 5、removeMetadataAndInitializedService方法中,修复清理subscribedDubboMetadataServiceURLs的bug,保证能清除旧的meta服务信息。
经过测试,主要修复了以下问题: 1、当最后一个远程服务挂了的时候,dubbo不再连接远程meta服务。 2、当远程服务多次重启或改变远程服务的数量或改变ip地址或改变dubbo端口,dubbo不再有多余的ExchangeClient。(各种场景经过debug 探测)
Resubmit the code and do the following:
After testing, the following problems were mainly fixed:
从描述上来看, 你的这次修复是正确的.
From the description, your repair is correct.
提供补丁jar包供测试。 使用场景: 版本:
晚上我也看看, 验证一下, 白天上班呢
可以尝试下调用这个比较关键的方法 org.apache.dubbo.common.Resetable.reset(URL url);
别的方法都不会改变内存中旧的client,因为他们都是新增client,只有这个方法可以改变重连的目的地址。
Can try to cut the more critical way org.apache.dubbo.com mon. Resetable. Reset URL (URL);
Other methods do not change the old client in memory, because they are new clients, only this method can change the reconnection destination address.
你可以加上这个尝试下
dubbo.protocol.port: ${random.int[25000,65000]}
出现连接旧地址这个现象是时间差引起的,也是偶发的,ExchangeClient重连间隔是60秒,但是ExchangeClient关闭是延时10秒关闭的,也就是说,在延时这10秒当中,碰上了ExchangeClient重连,就会出现连接错误。但最多只会出现一次,不会像以前一样每隔60秒持续不断的连接。
控制ExchangeClient延时关闭的时间参数,是dubbo.service.shutdown.wait, 单位是毫秒,默认10000
The phenomenon of connecting to the old address is caused by time difference, which is also accidental. The interval between reconnections of exchangeclient is 60 seconds, but the closing of exchangeclient is delayed by 10 seconds. That is to say, when in the 10 seconds , the reconnection of exchangeclient will result in connection error. But only once at most, it won't continue to connect every 60 seconds as before.
The time parameter to control the delayed shutdown of exchangeclient is dubbo.service.shutdown.wait, the unit is MS, and the default is 10000
你提到的org.apache.dubbo.common.Resetable.reset(URL url)是个接口,具体接口,应该就是ExchangeClient,实现类是ReferenceCountExchangeClient。 public interface ExchangeClient extends Client, ExchangeChannel public interface Client extends Endpoint, Channel, Resetable, IdleSensible 我们能够到达的地方也就是通过各级proxy将消息传递到DubboInvoker类,由这个类来处理ExchangeClient的动作,但是这个类不提供控制ExchangeClient的reset方法。 也就是说,ReferenceCountExchangeClient对象我们在外层是拿不到的。
The org.apache.dubbo.common.reset (URL url) you mentioned is an interface. To be specific, it should be exchangeclient, The implementation class is ReferenceCountExchangeClient.
public interface ExchangeClient extends Client, ExchangeChannel public interface Client extends Endpoint, Channel, Resetable, IdleSensible
Where we can reach is to pass messages to the DubboInvoker class through all levels of proxies, which handles the actions of ExchangeClient, but this class does not provide the reset method to control ExchangeClient. In other words, the referencecounteexchangeclient object is not available to us in the outer layer.
@wangzihaogithub 你的测试出现连接异常是每隔一分钟持续的,还是只有一次? 你的reset方法用不上,因为在docker容器环境下,ip会经常发生变化,尤其在动态伸缩器的作用下,微服务数量随时可能增加和减少,如果远程微服务数量减少的时候,本服务内对应的ExchangeClient是需要被彻底销毁的,不可能被重用,因此reset即便能用,也用不上。 ReferenceCountExchangeClient这个对象我认真分析过,它是复用的,拥有一个引用计数器。一个远程的dubbo服务提供者在本地只会产生一个对应的ReferenceCountExchangeClient,包括一个meta服务、n个普通服务都使用这一个ReferenceCountExchangeClient,当meta服务或普通服务销毁时,引用计数器会减1,如果为0,则关闭。 因此,我们要控制的就是不能让引用计数器出现错误,所有服务都销毁的时候必须要减到0,这样ReferenceCountExchangeClient才能被销毁。
@Wangzihaogithub do you test for connection exceptions every other minute, or only once? Your reset method can't be used, because in the docker container environment, IP will often change, especially in the role of dynamic extender, the number of microservices may increase and decrease at any time. If the number of remote microservices decreases, the exchangeclient in this service needs to be completely destroyed and cannot be reused, so even if reset can be used, it can't be used. Referencecounteexchangeclient, which I have carefully analyzed, is reusable and has a reference counter. A remote Dubbo service provider can only generate a referencecounteexchangeclient, which is used by a meta service and N normal services. When the service is destroyed, the reference counter will be reduced by one. If it is 0, it will be closed. Therefore, we need to control that the reference counter cannot make errors and must be reduced to 0
关于ReferenceCountExchangeClient,我画了个图 I drew a picture of ReferenceCountExchangeClient
是一直持续的, 等了5分钟还有
是用补丁包覆盖还是在代码层覆盖?是不是补丁没生效? 我这边测试没有问题的,只有debug的时候,未及时关闭,会出现这个错误,一旦关闭了,就没有了
Is it a patch or a code layer? Is the patch not working? There is no problem in my test. Only when debugging, if it is not closed in time, this error will appear. Once it is closed, there will be no problem
你可以尝试在你的测试服务代码中加入我的两个类(对应的包名不能变),不用补丁试试看,不同操作系统补丁加载机制可能有不同。
You can try to add my two classes to your test service code (the corresponding package name cannot be changed). Try without a patch. The patch loading mechanism of different operating systems may be different.
应该和这些参数没有关系,你试试在你本地代码新建2个包,复制2个类的代码进去测试一下。
It should have nothing to do with these parameters. Try to create 2 new packages locally and copy the code of 2 classes to test.
我提供一个测试工程吧,里面包含2个应用ServiceA(服务提供者)及ServiceB(服务消费者),两个应用均已经打好补丁。
代码地址: https://github.com/lcg72/testAlibaba
可以去掉补丁用同样的测试用例,看有什么不同。去掉补丁的方法:删除两个补丁类
测试用例放在附件中:
I will provide a test project, which includes two applications, ServiceA (service provider) and serviceb (service consumer). Both applications have been patched. Code address: https://github.com/lcg72/testalibaba
You can remove the patch and use the same test case to see what's different. How to remove patches: Delete two patch classes
Test cases are attached
我明天试下,你太棒了
I'll try it tomorrow. You're great
生产者无法恢复. 我把你的用例改了下, 并且说明了下情况, 你再试下. https://github.com/lcg72/testAlibaba/pull/1
The producer can't recover. I changed your use case and explained the situation. Try again. https://github.com/lcg72/testAlibaba/pull/1
生产者无法恢复. 我把你的用例改了下, 并且说明了下情况, 你再试下. lcg72/testAlibaba#1
The producer can't recover. I changed your use case and explained the situation. Try again. lcg72/testAlibaba#1
合入了你对测试工程提的改变,我这边测试没有问题啊,生产者重启能够正常注册,消费者那边访问也正常,能说一下你的环境么? 我的环境是:Macos 10.15.4,JDK8,Nacos 1.1.4
生产者无法恢复. 我把你的用例改了下, 并且说明了下情况, 你再试下. lcg72/testAlibaba#1
The producer can't recover. I changed your use case and explained the situation. Try again. lcg72/testAlibaba#1
非常感谢@wangzihaogithub的测试,我在window10,复现了你的故障。问题出在windows下的idea,在idea中run模式运行,如果直接停止或者重启应用,会立即关闭,类似kill -9,也就是说,idea并不是发送kill消息给应用,然后等应用自己关闭,idea直接就把应用干掉了,这样dubbo优雅关闭就无法实现。我很奇怪,mac下idea并不这么做。
windows10下的idea用debug模式运行,停止及重启会触发ShutdownHook,run模式则不触发ShutdownHook,而在mac下idea不管哪种模式都会触发ShutdownHook。不知道是不是idea的bug
window10下正确的测试,还是在终端下用java -jar运行应用,Ctrl+C来终止应用,这样dubbo应用就能实现优雅关闭了(能够通知到注册服务器)。
Thank you very much for @ wangzihaogithub's test. I repeated your fault in Windows 10. The problem is that the idea in Windows runs in run mode. If the application is stopped or restarted directly, it will be shut down immediately. Similar to kill-9, that is to say, the idea does not send a kill message to the application, and then wait for the application to shut down itself, and the idea will directly kill the application, so Dubbo elegant shut down cannot be realized. I'm surprised that idea doesn't do this under Mac.
Under Windows 10, idea runs in debug mode, shutdown hook will be triggered when it stops and restarts, and shutdown hook will not be triggered when it runs, while under Mac, idea will trigger shutdown hook regardless of any mode. I don't know if it's an idea bug
The correct test under Windows 10 is to run the application with Java jar under the terminal, and use Ctrl + C to terminate the application, so that the Dubbo application can be closed gracefully (the registration server can be notified).
我试了在mac(linux应该是一样的)下,kill -9 服务A,服务B有几次连接旧端口错误后,服务B不再连接服务A旧的端口(nacos服务发现如果还有服务A,就会连接错误,nacos服务中没有服务A,则不会再连接服务A)。这时再启动服务A,服务B访问正常。可见win10下idea的run模式stop,连kill -9 都不是。
I tried to kill-9 serviceA under MAC (Linux should be the same). After several times of connecting the old port errors, serviceB no longer connected to the old port (Nacos service found that if there is service a, it will connect the error. If there is no service a in Nacos service, it will not connect to service a again). At this time, start service a again. Service B has normal access. It can be seen that the run mode of idea under win10 is not even kill-9.
环境 nacos 1.2.1+SCA 2.2.1+k8s
首次启动调用正常,提供者更新部署,podIP发生改变。导致消费者依旧调用旧的IP,导致服务无法调通。
环境 nacos 1.2.1+SCA 2.2.1+k8s 首次启动调用正常,提供者更新部署,podIP发生改变。导致消费者依旧调用旧的IP,导致服务无法调通。
@lcg72 它后来改写的不支持kill 9, 也就是k8s环境
你可以用我发的这个类先解决问题.
@Bean public IGDubboRegistryInvokerRebuildListener dubboRegistryInvokerRebuildListener(){ return new IGDubboRegistryInvokerRebuildListener(); }
环境 nacos 1.2.1+SCA 2.2.1+k8s 首次启动调用正常,提供者更新部署,podIP发生改变。导致消费者依旧调用旧的IP,导致服务无法调通。
@lcg72 它后来改写的不支持kill 9, 也就是k8s环境
你可以用我发的这个类先解决问题.
@bean public IGDubboRegistryInvokerRebuildListener dubboRegistryInvokerRebuildListener(){ return new IGDubboRegistryInvokerRebuildListener(); }
你好 o.removeMetadata(serviceName); o.removeInitializedService(serviceName); 这俩方法在sca2.2.1里已经没有了,我看到只有一个removeMetadataAndInitializedService方法,不知道应该怎么使用,能否更新下你的方法,感谢 @wangzihaogithub
环境 nacos 1.2.1+SCA 2.2.1+k8s 首次启动调用正常,提供者更新部署,podIP发生改变。导致消费者依旧调用旧的IP,导致服务无法调通。
@lcg72 它后来改写的不支持kill 9, 也就是k8s环境 你可以用我发的这个类先解决问题. @bean public IGDubboRegistryInvokerRebuildListener dubboRegistryInvokerRebuildListener(){ return new IGDubboRegistryInvokerRebuildListener(); } IGDubboRegistryInvokerRebuildListener.txt
你好 o.removeMetadata(serviceName); o.removeInitializedService(serviceName); 这俩方法在sca2.2.1里已经没有了,我看到只有一个removeMetadataAndInitializedService方法,不知道应该怎么使用,能否更新下你的方法,感谢 @wangzihaogithub
可以 , 都是相同的逻辑
环境 nacos 1.2.1+SCA 2.2.1+k8s 首次启动调用正常,提供者更新部署,podIP发生改变。导致消费者依旧调用旧的IP,导致服务无法调通。
@lcg72 它后来改写的不支持kill 9, 也就是k8s环境 你可以用我发的这个类先解决问题. @bean public IGDubboRegistryInvokerRebuildListener dubboRegistryInvokerRebuildListener(){ return new IGDubboRegistryInvokerRebuildListener(); } IGDubboRegistryInvokerRebuildListener.txt
你好 o.removeMetadata(serviceName); o.removeInitializedService(serviceName); 这俩方法在sca2.2.1里已经没有了,我看到只有一个removeMetadataAndInitializedService方法,不知道应该怎么使用,能否更新下你的方法,感谢 @wangzihaogithub
可以 , 都是相同的逻辑
@wangzihaogithub 能发一下新的吗。因为removeMetadataAndInitializedService 这个方法在sca2.2.1版本里需要两个参数 一个servicename 和 url ,但是这个url不知道怎么获取,我对这块的源码不是很了解,能否提供下,感谢
我在k8s中仍旧遇到这个问题,
环境 nacos 1.2.1+SCA 2.2.1+k8s 首次启动调用正常,提供者更新部署,podIP发生改变。导致消费者依旧调用旧的IP,导致服务无法调通。
@lcg72 它后来改写的不支持kill 9, 也就是k8s环境 你可以用我发的这个类先解决问题. @bean public IGDubboRegistryInvokerRebuildListener dubboRegistryInvokerRebuildListener(){ return new IGDubboRegistryInvokerRebuildListener(); } IGDubboRegistryInvokerRebuildListener.txt
你好 o.removeMetadata(serviceName); o.removeInitializedService(serviceName); 这俩方法在sca2.2.1里已经没有了,我看到只有一个removeMetadataAndInitializedService方法,不知道应该怎么使用,能否更新下你的方法,感谢 @wangzihaogithub
可以 , 都是相同的逻辑
想咨询下这个地方的代码怎么实装,是先得回退到2.2.0版本么 @wangzihaogithub
@lcg72 在k8s环境下,应该是都实现了dubbo的优雅停机的,但是2.2.1版本仍然是没有把下线的服务从服务列表中摘除掉
@lcg72 在k8s环境下,应该是都实现了dubbo的优雅停机的,但是2.2.1版本仍然是没有把下线的服务从服务列表中摘除掉
没错,我这也是,搞了优雅停机了,是正常kill -15 ,也触发了shoutdown 钩子,但是还是连旧服务,这个问题应该重开,小马哥能看下吗 @mercyblitz
我在k8s中仍旧遇到这个问题,
环境 nacos 1.2.1+SCA 2.2.1+k8s 首次启动调用正常,提供者更新部署,podIP发生改变。导致消费者依旧调用旧的IP,导致服务无法调通。
@lcg72 它后来改写的不支持kill 9, 也就是k8s环境 你可以用我发的这个类先解决问题. @bean public IGDubboRegistryInvokerRebuildListener dubboRegistryInvokerRebuildListener(){ return new IGDubboRegistryInvokerRebuildListener(); } IGDubboRegistryInvokerRebuildListener.txt
你好 o.removeMetadata(serviceName); o.removeInitializedService(serviceName); 这俩方法在sca2.2.1里已经没有了,我看到只有一个removeMetadataAndInitializedService方法,不知道应该怎么使用,能否更新下你的方法,感谢 @wangzihaogithub
可以 , 都是相同的逻辑
想咨询下这个地方的代码怎么实装,是先得回退到2.2.0版本么 @wangzihaogithub
你好 我这边也遇到这个问题了 想问一下那个url参数是怎么传的 @wangzihaogithub
我的版本是spring-cloud-alibaba-dependencies:2.2.0.RELEASE,spring-cloud-dependencies:Hoxton.RELEASE,还有有这个问题,请问需要怎么解决。
2021-08-23 11:13:29.435 INFO 13008 --- [eCheck-thread-1] o.a.d.r.e.s.header.ReconnectTimerTask : [DUBBO] Initial connection to HeaderExchangeClient [channel=org.apache.dubbo.remoting.transport.netty4.NettyClient [192.168.13.1:0 -> /192.168.88.100:20900]], dubbo version: 2.7.4.1, current host: 192.168.13.1
2021-08-23 11:13:31.440 ERROR 13008 --- [eCheck-thread-1] o.a.d.r.e.s.header.ReconnectTimerTask : [DUBBO] Fail to connect to HeaderExchangeClient [channel=org.apache.dubbo.remoting.transport.netty4.NettyClient [192.168.13.1:0 -> /192.168.88.100:20900]], dubbo version: 2.7.4.1, current host: 192.168.13.1
org.apache.dubbo.remoting.RemotingException: client(url: dubbo://192.168.88.100:20900/com.alibaba.cloud.dubbo.service.DubboMetadataService?anyhost=false&application=jingxun-campus-service&bind.ip=192.168.88.100&bind.port=20900&check=false&codec=dubbo&deprecated=false&dubbo=2.0.2&dynamic=true&generic=true&group=jingxun-common-account&heartbeat=60000&interface=com.alibaba.cloud.dubbo.service.DubboMetadataService&lazy=false&methods=getAllServiceKeys,getServiceRestMetadata,getExportedURLs,getAllExportedURLs&pid=13008&qos.enable=false®ister.ip=192.168.13.1&release=2.7.4.1&remote.application=jingxun-common-account&revision=2.2.0.RELEASE&side=consumer&sticky=false&timeout=2000×tamp=1629685852742&version=1.0.0) failed to connect to server /192.168.88.100:20900, error message is:Connection refused: no further information: /192.168.88.100:20900
at org.apache.dubbo.remoting.transport.netty4.NettyClient.doConnect(NettyClient.java:161)
at org.apache.dubbo.remoting.transport.AbstractClient.connect(AbstractClient.java:190)
at org.apache.dubbo.remoting.transport.AbstractClient.reconnect(AbstractClient.java:246)
at org.apache.dubbo.remoting.exchange.support.header.HeaderExchangeClient.reconnect(HeaderExchangeClient.java:155)
at org.apache.dubbo.remoting.exchange.support.header.ReconnectTimerTask.doTask(ReconnectTimerTask.java:49)
at org.apache.dubbo.remoting.exchange.support.header.AbstractTimerTask.run(AbstractTimerTask.java:87)
at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:648)
at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:727)
at org.apache.dubbo.common.timer.HashedWheelTimer$Worker.run(HashedWheelTimer.java:449)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: no further information: /192.168.88.100:20900
Caused by: java.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Which Component Nacos Discovery, Dubbo
Describe the bug
nacos作为服务注册中心,当服务端掉线(服务实例改变事件),dubbo重连任务每隔1分钟不间断进行。原因可能是spring-cloud-alibaba没有把服务实例改变消息传递进dubbo,也有可能是dubbo没有处理这个消息
Nacos serves as the service registry. When the service end drops the line (service instance change event), Dubbo reconnection task will continue every 1 minute. The reason may be that spring cloud Alibaba did not deliver the service instance change message to Dubbo, or Dubbo did not process the message
To Reproduce Steps to reproduce the behavior: 配置服务端和消费端的注册中心为nacos: 1.服务端参数:dubbo.protocol.name=dubbo,dubbo.protocol.port=28801,dubbo.scan.base-packages=xx.xx.xx,dubbo.registry.address=spring-cloud://localhost 2.消费端参数:dubbo.consumer.check=false,dubbo.registry.address=spring-cloud://localhost
配置服务端和消费端的注册中心为zookeeper: 1.服务端参数:dubbo.protocol.name=dubbo,dubbo.protocol.port=28801,dubbo.scan.base-packages=xx.xx.xx,dubbo.registry.address=zookeeper://127.0.0.1:2181 2.消费端参数:dubbo.consumer.check=false,dubbo.registry.address=zookeeper://127.0.0.1:2181
以上场景模拟在docker容器中的场景,在docker容器中,微服务经常重新启动,重启微服务以后,其ip地址会发生变化(此处用端口变化模拟地址变化,道理是一样的)。
结论: 1、zookeeper作为服务注册中心,没有问题, 2、nacos作为服务注册中心,当服务端掉线(服务实例改变事件)并未处理该通道,造成重连任务不断进行。原因可能是spring-cloud-alibaba没有把服务实例改变消息传递进dubbo,也有可能是dubbo没有处理这个消息
Configure the registry of the server and the consumer as Nacos:
Configure the registry of the server and consumer as zookeeper:
The above scenario simulates the scenario in the docker container. In the docker container, the microservice is often restarted. After restarting the microservice, its IP address will change (Port change is used here to simulate address change, the same principle).
Conclusion:
Expected behavior
Expect Nacos as a registry as normal as zookeeper
Screenshots
Additional context CentOS 7.5 、Java8 、 Version 2.2.0.RELEASE