apache / dubbo

The java implementation of Apache Dubbo. An RPC and microservice framework.
https://dubbo.apache.org/
Apache License 2.0
40.4k stars 26.42k forks source link

curator_version版本4.0.1存在缺陷 #8689

Closed zrlw closed 2 years ago

zrlw commented 3 years ago

Environment

Steps to reproduce this issue

现在dubbo3.0在github构建动不动就timeout,看了一下workflow的build-and-test-3.yml,unit test的timeout-minutes分别是40分钟(with Integration Tests)和50分钟(without Integration Tests),干脆都改成60分钟,效果好多了,但是还是有构建失败,查了日志发现充斥着这个东西: [Curator-ConnectionStateManager-0] WARN org.apache.curator.framework.state.ConnectionStateManager - Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 63515. Adjusted session timeout ms: 60000 确认了一下这个是个缺陷,4.1.0才修复,而dubbo使用的版本是4.0.1 https://issues.apache.org/jira/browse/CURATOR-405

zrlw commented 3 years ago

看了一眼master分支,用的版本也是4.0.1。

顺便说一下,AbstractClusterInvokerTest这个测试类的下列逻辑有毛病:

        Double d = (Math.random() * (max - min + 1) + min);
        int runs = d.intValue();
        Assertions.assertTrue(runs > min); 

random返回值可以是0的,所以此处断言应该是 >=,恰好github构建时就是0,所以构建失败了。

pinxiong commented 3 years ago

Environment

  • Dubbo version: 3.0

Steps to reproduce this issue

现在dubbo3.0在github构建动不动就timeout,看了一下workflow的build-and-test-3.yml,unit test的timeout-minutes分别是40分钟(with Integration Tests)和50分钟(without Integration Tests),干脆都改成60分钟,效果好多了,但是还是有构建失败,查了日志发现充斥着这个东西: [Curator-ConnectionStateManager-0] WARN org.apache.curator.framework.state.ConnectionStateManager - Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 63515. Adjusted session timeout ms: 60000 确认了一下这个是个缺陷,4.1.0才修复,而dubbo使用的版本是4.0.1 https://issues.apache.org/jira/browse/CURATOR-405

Thanks for your information! I have upgraded the version of curator to 4.1.0. Also, I found this kind of mistakes have been reduced a lot. see #8686

However, there are still something wrong with UT that make UT time-consuming longer, I'm trying to fix them.

If you are interested in this issue, welcome to join us!

zrlw commented 3 years ago

Environment

  • Dubbo version: 3.0

Steps to reproduce this issue

现在dubbo3.0在github构建动不动就timeout,看了一下workflow的build-and-test-3.yml,unit test的timeout-minutes分别是40分钟(with Integration Tests)和50分钟(without Integration Tests),干脆都改成60分钟,效果好多了,但是还是有构建失败,查了日志发现充斥着这个东西: [Curator-ConnectionStateManager-0] WARN org.apache.curator.framework.state.ConnectionStateManager - Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 63515. Adjusted session timeout ms: 60000 确认了一下这个是个缺陷,4.1.0才修复,而dubbo使用的版本是4.0.1 https://issues.apache.org/jira/browse/CURATOR-405

Thanks for your information! I have upgraded the version of curator to 4.1.0. Also, I found this kind of mistakes have been reduced a lot. see #8686

However, there are still something wrong with UT that make UT time-consuming longer, I'm trying to fix them.

If you are interested in this issue, welcome to join us!

看到你把GenericServiceTest拆成了两个UT,这样IllegalState Failed to check the status异常的触发诱因基本消除了,但是并没有彻底根除,根源是3.0分支的ServiceDiscoveryRegistryDirectory的refreshInvoker方法做优化的时候忘了考虑isAvailable()的感受了,我提了个 #8701 ,请各位复核一下这么修复是否妥当吧。

zrlw commented 3 years ago

curator 4.1.0存在的缺陷也不少,issues.apache.org查了一下,缺陷一直修复到了当前发布的5.2.0,但是升大版本号存在兼容风险,建议先充分测试一下再考虑版本5吧。 https://issues.apache.org/jira/browse/CURATOR-534?jql=project%20%3D%20CURATOR%20AND%20fixVersion%20%3D%204.2.0 https://issues.apache.org/jira/browse/CURATOR-534?jql=project%20%3D%20CURATOR%20AND%20fixVersion%20%3D%204.3.0 https://issues.apache.org/jira/browse/CURATOR-534?jql=project%20%3D%20CURATOR%20AND%20fixVersion%20%3D%205.0.0 https://issues.apache.org/jira/browse/CURATOR-534?jql=project%20%3D%20CURATOR%20AND%20fixVersion%20%3D%205.1.0 https://issues.apache.org/jira/browse/CURATOR-534?jql=project%20%3D%20CURATOR%20AND%20fixVersion%20%3D%205.2.0

pinxiong commented 3 years ago

curator 4.1.0存在的缺陷也不少,issues.apache.org查了一下,缺陷一直修复到了当前发布的5.2.0,但是升大版本号存在兼容风险,建议先充分测试一下再考虑版本5吧。 https://issues.apache.org/jira/browse/CURATOR-534?jql=project%20%3D%20CURATOR%20AND%20fixVersion%20%3D%204.2.0 https://issues.apache.org/jira/browse/CURATOR-534?jql=project%20%3D%20CURATOR%20AND%20fixVersion%20%3D%204.3.0 https://issues.apache.org/jira/browse/CURATOR-534?jql=project%20%3D%20CURATOR%20AND%20fixVersion%20%3D%205.0.0 https://issues.apache.org/jira/browse/CURATOR-534?jql=project%20%3D%20CURATOR%20AND%20fixVersion%20%3D%205.1.0 https://issues.apache.org/jira/browse/CURATOR-534?jql=project%20%3D%20CURATOR%20AND%20fixVersion%20%3D%205.2.0

I have tried to upgrade to the latest version 5.2.0, but there are something wrong with undefined classes. For compatibility, I just upgraded to 4.1.0

zrlw commented 3 years ago

UT慢可能还是和curator或zk有关, 比如23:44分开始不停地抛ClosedChannelException: (更新2021-10-12: Ignoring exception during shutdown input这些zk session都是已经前面完成测试的类建立的,但是因为destroy有缺陷导致这些session未被关闭: #9003 )

2021-09-06T23:44:55.1937413Z [06/09/21 23:44:54:076 UTC] Curator-ConnectionStateManager-0  WARN curator.CuratorZookeeperClient:  [DUBBO] Curator zookeeper connection of session 1000032d4730000 timed out. connection timeout value is 3000, session expire timeout value is 60000, dubbo version: 3.0.3-SNAPSHOT, current host: 10.1.0.234
2021-09-06T23:44:55.1939039Z 23:44:54.125 [main] DEBUG org.apache.curator.RetryLoop - Retrying operation
2021-09-06T23:44:55.1940191Z 23:44:55.075 [DubboSaveMetadataReport-thread-1] DEBUG org.apache.curator.RetryLoop - Retrying operation
2021-09-06T23:44:55.1941767Z 23:44:55.118 [SessionTracker] INFO org.apache.zookeeper.server.SessionTrackerImpl - SessionTrackerImpl exited loop!
2021-09-06T23:44:55.1943523Z 23:44:55.118 [SessionTracker] INFO org.apache.zookeeper.server.SessionTrackerImpl - SessionTrackerImpl exited loop!
2021-09-06T23:44:55.1945224Z 23:44:55.119 [SessionTracker] INFO org.apache.zookeeper.server.SessionTrackerImpl - SessionTrackerImpl exited loop!
2021-09-06T23:44:55.1946907Z 23:44:55.119 [SessionTracker] INFO org.apache.zookeeper.server.SessionTrackerImpl - SessionTrackerImpl exited loop!
2021-09-06T23:44:55.5686930Z 23:44:55.566 [main-SendThread(MiningMadness.com:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server MiningMadness.com/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2021-09-06T23:44:55.9022911Z 23:44:55.899 [main-SendThread(MiningMadness.com:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server MiningMadness.com/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2021-09-06T23:44:55.9025432Z 23:44:55.900 [main-SendThread(MiningMadness.com:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server MiningMadness.com/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2021-09-06T23:44:56.5741582Z 23:44:56.569 [main-SendThread(MiningMadness.com:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket error occurred: MiningMadness.com/127.0.0.1:2181: Connection refused: no further information
2021-09-06T23:44:56.5747927Z 23:44:56.569 [main-SendThread(MiningMadness.com:2181)] DEBUG org.apache.zookeeper.ClientCnxnSocketNIO - Ignoring exception during shutdown input
2021-09-06T23:44:56.5749630Z java.nio.channels.ClosedChannelException: null
2021-09-06T23:44:56.5751142Z    at java.base/sun.nio.ch.SocketChannelImpl.ensureOpen(SocketChannelImpl.java:160)
2021-09-06T23:44:56.5752932Z    at java.base/sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:913)
2021-09-06T23:44:56.5754570Z    at java.base/sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:398)
2021-09-06T23:44:56.5756327Z    at org.apache.zookeeper.ClientCnxnSocketNIO.cleanup(ClientCnxnSocketNIO.java:200)
2021-09-06T23:44:56.5757960Z    at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1250)
2021-09-06T23:44:56.5759224Z    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1174)

一直抛到了23:50分

2021-09-06T23:50:23.1053550Z 23:50:23.103 [main-SendThread(MiningMadness.com:2181)] DEBUG org.apache.zookeeper.ClientCnxnSocketNIO - Ignoring exception during shutdown input
2021-09-06T23:50:23.1055909Z java.nio.channels.ClosedChannelException: null
2021-09-06T23:50:23.1057451Z    at java.base/sun.nio.ch.SocketChannelImpl.ensureOpen(SocketChannelImpl.java:160)
2021-09-06T23:50:23.1060347Z    at java.base/sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:913)
2021-09-06T23:50:23.1062142Z    at java.base/sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:398)
2021-09-06T23:50:23.1063975Z    at org.apache.zookeeper.ClientCnxnSocketNIO.cleanup(ClientCnxnSocketNIO.java:200)
2021-09-06T23:50:23.1067166Z    at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1250)
2021-09-06T23:50:23.1068609Z    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1174)
2021-09-06T23:50:23.1070339Z 23:50:23.103 [main-SendThread(MiningMadness.com:2181)] DEBUG org.apache.zookeeper.ClientCnxnSocketNIO - Ignoring exception during shutdown output
2021-09-06T23:50:23.1071930Z java.nio.channels.ClosedChannelException: null
2021-09-06T23:50:23.1073323Z    at java.base/sun.nio.ch.SocketChannelImpl.ensureOpen(SocketChannelImpl.java:160)
2021-09-06T23:50:23.1074973Z    at java.base/sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:930)
2021-09-06T23:50:23.1076560Z    at java.base/sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:406)
2021-09-06T23:50:23.1078240Z    at org.apache.zookeeper.ClientCnxnSocketNIO.cleanup(ClientCnxnSocketNIO.java:207)
2021-09-06T23:50:23.1079850Z    at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1250)
2021-09-06T23:50:23.1081110Z    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1174)
2021-09-06T23:50:23.2764195Z [06/09/21 23:50:23:271 UTC] DubboMetadataReportRetryTimer-thread-1  INFO support.AbstractMetadataReport$MetadataReportRetry:  [DUBBO] start to retry task for metadata report. retry times:25, dubbo version: 3.0.3-SNAPSHOT, current host: 10.1.0.234

详细日志: https://pipelines.actions.githubusercontent.com/edZSyu50aYGyPRvHKTBXednMbERgfjJXUY03o7vWNa7z6fodei/_apis/pipelines/1/runs/62/signedlogcontent/32?urlExpires=2021-09-07T04%3A42%3A44.6534861Z&urlSigningMethod=HMACV1&urlSignature=Gzxk%2BCKAQpxTrwnbh2QlkSIO01Ul%2B7ad2S0%2FD%2FeCXOE%3D

划重点,下列内容高频出现: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for xxx/x.x.x.x:20880 at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) 是不是似曾相识? https://github.com/apache/dubbo/issues/3510

zrlw commented 3 years ago

Curator 4.0 has a hard dependency on ZooKeeper 3.5.x,但是dubbo3依然用的是3.4.13

zrlw commented 3 years ago

建议拉个测试分支,把zk和curator都升一下,curator5.2.0对应的zk应该是3.6.3

pinxiong commented 3 years ago

Curator 4.0 has a hard dependency on ZooKeeper 3.5.x,但是dubbo3依然用的是3.4.13

建议拉个测试分支,把zk和curator都升一下,curator5.2.0对应的zk应该是3.6.3

Because there still are a lot of users using the old version 3.4.13, and the community has considered this situation, also supported the higher version.

You can see dubbo-remoting-zookeeper-curator5

Dubbo can select the right version depending on the deployment environment.

zrlw commented 3 years ago

找到另外一处慢的地方(DubboBootstrapMultiInstanceTest) (注:初步判断是parallel test导致的,已另开issue #8725) 23:18分开始org.apache.dubbo.remoting.transport.netty4.NettyClient.doConnect(NettyClient.java:171)

2021-09-06T23:18:01.6130806Z org.apache.dubbo.remoting.RemotingException: client(url: dubbo://10.1.0.234:2001/org.apache.dubbo.metadata.MetadataService?codec=dubbo&connect.lazy.initial.state=true&dubbo=2.0.2&group=provider-app&heartbeat=60000&port=2001&protocol=dubbo&release=3.0.3-SNAPSHOT&send.reconnect=true&side=consumer&timeout=5000&version=1.0.0) failed to connect to server /10.1.0.234:2001, error message is:Connection refused: no further information: /10.1.0.234:2001
2021-09-06T23:18:01.6140508Z    at org.apache.dubbo.remoting.transport.netty4.NettyClient.doConnect(NettyClient.java:171)
2021-09-06T23:18:01.6143201Z    at org.apache.dubbo.remoting.transport.AbstractClient.connect(AbstractClient.java:202)
2021-09-06T23:18:01.6145717Z    at org.apache.dubbo.remoting.transport.AbstractClient.reconnect(AbstractClient.java:258)
2021-09-06T23:18:01.6155334Z    at org.apache.dubbo.remoting.exchange.support.header.HeaderExchangeClient.reconnect(HeaderExchangeClient.java:166)
2021-09-06T23:18:01.6164494Z    at org.apache.dubbo.remoting.exchange.support.header.ReconnectTimerTask.doTask(ReconnectTimerTask.java:49)
2021-09-06T23:18:01.6167756Z    at org.apache.dubbo.remoting.exchange.support.header.AbstractTimerTask.run(AbstractTimerTask.java:87)
2021-09-06T23:18:01.6170356Z    at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:648)
2021-09-06T23:18:01.6172596Z    at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:727)
2021-09-06T23:18:01.6174716Z    at org.apache.dubbo.common.timer.HashedWheelTimer$Worker.run(HashedWheelTimer.java:449)
2021-09-06T23:18:01.6175998Z    at java.base/java.lang.Thread.run(Thread.java:829)
2021-09-06T23:18:01.6177247Z Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: no further information: /10.1.0.234:2001
2021-09-06T23:18:01.6178649Z Caused by: java.net.ConnectException: Connection refused: no further information
2021-09-06T23:18:01.6179878Z    at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
2021-09-06T23:18:01.6181385Z    at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
2021-09-06T23:18:01.6183388Z    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
2021-09-06T23:18:01.6185632Z    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
2021-09-06T23:18:01.6187597Z    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:707)
2021-09-06T23:18:01.6189735Z    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
2021-09-06T23:18:01.6191923Z    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
2021-09-06T23:18:01.6193307Z    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
2021-09-06T23:18:01.6194818Z    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
2021-09-06T23:18:01.6196422Z    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
2021-09-06T23:18:01.6198282Z    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
2021-09-06T23:18:01.6199537Z    at java.base/java.lang.Thread.run(Thread.java:829)

日志显示HashedWheelTimer的worker线程一直调用expireTimeouts方法到了23:34分,持续调用了16分钟:

2021-09-06T23:34:33.6234093Z org.apache.dubbo.remoting.RemotingException: client(url: dubbo://10.1.0.234:2001/org.apache.dubbo.metadata.MetadataService?codec=dubbo&connect.lazy.initial.state=true&dubbo=2.0.2&group=provider-app&heartbeat=60000&port=2001&protocol=dubbo&release=3.0.3-SNAPSHOT&send.reconnect=true&side=consumer&timeout=5000&version=1.0.0) failed to connect to server /10.1.0.234:2001, error message is:Connection refused: no further information: /10.1.0.234:2001
2021-09-06T23:34:33.6238443Z    at org.apache.dubbo.remoting.transport.netty4.NettyClient.doConnect(NettyClient.java:171)
2021-09-06T23:34:33.6241063Z    at org.apache.dubbo.remoting.transport.AbstractClient.connect(AbstractClient.java:202)
2021-09-06T23:34:33.6245115Z    at org.apache.dubbo.remoting.transport.AbstractClient.reconnect(AbstractClient.java:258)
2021-09-06T23:34:33.6248407Z    at org.apache.dubbo.remoting.exchange.support.header.HeaderExchangeClient.reconnect(HeaderExchangeClient.java:166)
2021-09-06T23:34:33.6252119Z    at org.apache.dubbo.remoting.exchange.support.header.ReconnectTimerTask.doTask(ReconnectTimerTask.java:49)
2021-09-06T23:34:33.6255676Z    at org.apache.dubbo.remoting.exchange.support.header.AbstractTimerTask.run(AbstractTimerTask.java:87)
2021-09-06T23:34:33.6259548Z    at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:648)
2021-09-06T23:34:33.6265744Z    at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:727)
2021-09-06T23:34:33.6267624Z    at org.apache.dubbo.common.timer.HashedWheelTimer$Worker.run(HashedWheelTimer.java:449)
2021-09-06T23:34:33.6268785Z    at java.base/java.lang.Thread.run(Thread.java:829)
2021-09-06T23:34:33.6269927Z Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: no further information: /10.1.0.234:2001
2021-09-06T23:34:33.6271355Z Caused by: java.net.ConnectException: Connection refused: no further information
2021-09-06T23:34:33.6272490Z    at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
2021-09-06T23:34:33.6273861Z    at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
2021-09-06T23:34:33.6275648Z    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
2021-09-06T23:34:33.6277619Z    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
2021-09-06T23:34:33.6279377Z    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:707)
2021-09-06T23:34:33.6281298Z    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
2021-09-06T23:34:33.6283215Z    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
2021-09-06T23:34:33.6284730Z    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
2021-09-06T23:34:33.6286294Z    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
2021-09-06T23:34:33.6287930Z    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
2021-09-06T23:34:33.6289575Z    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
2021-09-06T23:34:33.6290896Z    at java.base/java.lang.Thread.run(Thread.java:829)

reconnectTimerTask一直运行了16分钟没有被cancel的原因应该是providerBootstrap.start()抛了异常:

2021-09-06T23:17:04.6300294Z [06/09/21 23:17:04:530 UTC] main ERROR config.ServiceConfig:  [DUBBO] Failed register interface application mapping for service org.apache.dubbo.config.api.DemoService:2.0, dubbo version: 3.0.3-SNAPSHOT, current host: 10.1.0.234
2021-09-06T23:17:04.6301602Z java.lang.RuntimeException
2021-09-06T23:17:04.6303393Z    at org.apache.dubbo.registry.client.metadata.MetadataServiceNameMapping.map(MetadataServiceNameMapping.java:88)
2021-09-06T23:17:04.6306097Z    at org.apache.dubbo.config.ServiceConfig.lambda$exported$1(ServiceConfig.java:270)
2021-09-06T23:17:04.6307317Z    at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
2021-09-06T23:17:04.6308654Z    at org.apache.dubbo.config.ServiceConfig.exported(ServiceConfig.java:266)
2021-09-06T23:17:04.6310056Z    at org.apache.dubbo.config.ServiceConfig.doExport(ServiceConfig.java:371)
2021-09-06T23:17:04.6311444Z    at org.apache.dubbo.config.ServiceConfig.export(ServiceConfig.java:244)
2021-09-06T23:17:04.6318288Z    at org.apache.dubbo.config.bootstrap.DubboBootstrap.exportServices(DubboBootstrap.java:1476)
2021-09-06T23:17:04.6320287Z    at org.apache.dubbo.config.bootstrap.DubboBootstrap.doStart(DubboBootstrap.java:1190)
2021-09-06T23:17:04.6322235Z    at org.apache.dubbo.config.bootstrap.DubboBootstrap.start(DubboBootstrap.java:1166)
2021-09-06T23:17:04.6325718Z    at org.apache.dubbo.config.bootstrap.DubboBootstrapMultiInstanceTest.testMultiModuleApplication(DubboBootstrapMultiInstanceTest.java:200)

然后这个RuntimeException被ServiceConfig的exported方法默默吃掉了。 PS: DubboBootstrapMultiInstanceTest这个测试类缺少断言,加了断言可能就不会过,比如testMultiModuleApplication这个方法,调用sayName实际是 org.apache.dubbo.remoting.RemotingException: Not found exported service: provider-app/org.apache.dubbo.metadata.MetadataService:1.0.0:2002 in [], may be version or group mismatch

zrlw commented 3 years ago

Curator 4.0 has a hard dependency on ZooKeeper 3.5.x,但是dubbo3单元测试依然用的是3.4.13

建议拉个测试分支,把zk和curator都升一下,curator5.2.0对应的zk应该是3.6.3

Because there still are a lot of users using the old version 3.4.13, and the community has considered this situation, also supported the higher version.

You can see dubbo-remoting-zookeeper-curator5

Dubbo can select the right version depending on the deployment environment.

not use forever retry policy..and infinite loop for "a while" - - https://issues.apache.org/jira/browse/CURATOR-325

如果用zk3.4.13,curator的版本号需要降到2.13.0(但是这个版本一堆缺陷),dubbo-dependencies-bom配死了zk和curator版本号,构建出来还能自动根据环境选择版本?好神奇啊 这里讨论的是unit test耗时长的问题,针对的zk也是unit test用的zk版本,感觉你可能理解错背景了。

pinxiong commented 3 years ago

Curator 4.0 has a hard dependency on ZooKeeper 3.5.x,但是dubbo3依然用的是3.4.13

建议拉个测试分支,把zk和curator都升一下,curator5.2.0对应的zk应该是3.6.3

Because there still are a lot of users using the old version 3.4.13, and the community has considered this situation, also supported the higher version. You can see dubbo-remoting-zookeeper-curator5 Dubbo can select the right version depending on the deployment environment.

not use forever retry policy..and infinite loop for "a while" - - https://issues.apache.org/jira/browse/CURATOR-325

Yes, dubbo don't use RetryForever instead of RetryNTimes

pinxiong commented 3 years ago

看了一眼master分支,用的版本也是4.0.1。

顺便说一下,AbstractClusterInvokerTest这个测试类的下列逻辑有毛病:

        Double d = (Math.random() * (max - min + 1) + min);
        int runs = d.intValue();
        Assertions.assertTrue(runs > min); 

random返回值可以是0的,所以此处断言应该是 >=,恰好github构建时就是0,所以构建失败了。

Yes, you can open a new PR to fix this problem

zrlw commented 3 years ago

RetryNTimes

单元测试设置的Retry次数是不是有点多呢,构建日志显示retry动不动就搞5、6分钟

pinxiong commented 3 years ago

单元测试设置的Retry次数是不是有点多呢,构建日志显示retry动不动就搞5、6分钟

I don't know where you see UT retry times in the code

pinxiong commented 3 years ago

Curator 4.0 has a hard dependency on ZooKeeper 3.5.x,但是dubbo3单元测试依然用的是3.4.13

建议拉个测试分支,把zk和curator都升一下,curator5.2.0对应的zk应该是3.6.3

Because there still are a lot of users using the old version 3.4.13, and the community has considered this situation, also supported the higher version. You can see dubbo-remoting-zookeeper-curator5 Dubbo can select the right version depending on the deployment environment.

not use forever retry policy..and infinite loop for "a while" - - https://issues.apache.org/jira/browse/CURATOR-325

如果用zk3.4.13,curator的版本号需要降到2.13.0(但是这个版本一堆缺陷),dubbo-dependencies-bom配死了zk和curator版本号,构建出来还能自动根据环境选择版本?好神奇啊 这里讨论的是unit test耗时长的问题,针对的zk也是unit test用的zk版本,感觉你可能理解错背景了。

You can see ZookeeperTransporter

zrlw commented 3 years ago

Curator 4.0 has a hard dependency on ZooKeeper 3.5.x,但是dubbo3单元测试依然用的是3.4.13

建议拉个测试分支,把zk和curator都升一下,curator5.2.0对应的zk应该是3.6.3

Because there still are a lot of users using the old version 3.4.13, and the community has considered this situation, also supported the higher version. You can see dubbo-remoting-zookeeper-curator5 Dubbo can select the right version depending on the deployment environment.

not use forever retry policy..and infinite loop for "a while" - - https://issues.apache.org/jira/browse/CURATOR-325 如果用zk3.4.13,curator的版本号需要降到2.13.0(但是这个版本一堆缺陷),dubbo-dependencies-bom配死了zk和curator版本号,构建出来还能自动根据环境选择版本?好神奇啊 这里讨论的是unit test耗时长的问题,针对的zk也是unit test用的zk版本,感觉你可能理解错背景了。

You can see ZookeeperTransporter

这么牛!学到了。github做ut使用的zk能用高版本吗?从构建日志暴露的信息看,和之前curator客户端版本和zk不兼容的现象是相同的

zrlw commented 3 years ago

Curator 4.0 has a hard dependency on ZooKeeper 3.5.x,但是dubbo3单元测试依然用的是3.4.13

建议拉个测试分支,把zk和curator都升一下,curator5.2.0对应的zk应该是3.6.3

Because there still are a lot of users using the old version 3.4.13, and the community has considered this situation, also supported the higher version. You can see dubbo-remoting-zookeeper-curator5 Dubbo can select the right version depending on the deployment environment.

not use forever retry policy..and infinite loop for "a while" - - https://issues.apache.org/jira/browse/CURATOR-325 如果用zk3.4.13,curator的版本号需要降到2.13.0(但是这个版本一堆缺陷),dubbo-dependencies-bom配死了zk和curator版本号,构建出来还能自动根据环境选择版本?好神奇啊 这里讨论的是unit test耗时长的问题,针对的zk也是unit test用的zk版本,感觉你可能理解错背景了。

You can see ZookeeperTransporter

这么牛!学到了。github做ut使用的zk能用高版本吗?从构建日志暴露的信息看,和之前curator客户端版本和zk不兼容的现象是相同的

pinxiong commented 3 years ago

这么牛!学到了。github做ut使用的zk能用高版本吗?从构建日志暴露的信息看,和之前curator客户端版本和zk不兼容的现象是相同的

I'm trying to use higher version now. But it doesn't work.

I updated the dependencies in dubbo-config-spring below, and you can have a try.

    <properties>
        <skip_maven_deploy>false</skip_maven_deploy>
        <spring-boot.version>2.3.1.RELEASE</spring-boot.version>
        <curator5_version>5.1.0</curator5_version>
        <zookeeper_version>3.6.0</zookeeper_version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-config-api</artifactId>
            <version>${project.parent.version}</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-beans</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-context</artifactId>
        </dependency>
        <dependency>
            <groupId>javax.servlet</groupId>
            <artifactId>javax.servlet-api</artifactId>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>com.alibaba.spring</groupId>
            <artifactId>spring-context-support</artifactId>
        </dependency>

        <!-- Testing Dependencies -->
        <dependency>
            <groupId>org.aspectj</groupId>
            <artifactId>aspectjweaver</artifactId>
            <version>1.9.5</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-monitor-default</artifactId>
            <version>${project.parent.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-rpc-dubbo</artifactId>
            <version>${project.parent.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-rpc-triple</artifactId>
            <version>${project.parent.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-rpc-injvm</artifactId>
            <version>${project.parent.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-rpc-rest</artifactId>
            <version>${project.parent.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-remoting-netty4</artifactId>
            <version>${project.parent.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-serialization-hessian2</artifactId>
            <version>${project.parent.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>javax.validation</groupId>
            <artifactId>validation-api</artifactId>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.hibernate</groupId>
            <artifactId>hibernate-validator</artifactId>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.glassfish</groupId>
            <artifactId>javax.el</artifactId>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-tx</artifactId>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-test</artifactId>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.tomcat.embed</groupId>
            <artifactId>tomcat-embed-core</artifactId>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <version>${spring-boot.version}</version>
            <scope>test</scope>
            <exclusions>
                <exclusion>
                    <artifactId>junit-vintage-engine</artifactId>
                    <groupId>org.junit.vintage</groupId>
                </exclusion>
            </exclusions>
        </dependency>

        <!-- Zookeeper dependencies for testing -->
        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-registry-zookeeper</artifactId>
            <version>${project.parent.version}</version>
            <scope>test</scope>
            <exclusions>
                <exclusion>
                    <groupId>org.apache.dubbo</groupId>
                    <artifactId>dubbo-remoting-zookeeper</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-metadata-report-zookeeper</artifactId>
            <version>${project.parent.version}</version>
            <scope>test</scope>
            <exclusions>
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.apache.dubbo</groupId>
                    <artifactId>dubbo-remoting-zookeeper</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-remoting-zookeeper-curator5</artifactId>
            <version>${project.parent.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.curator</groupId>
            <artifactId>curator-framework</artifactId>
            <version>${curator5_version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.curator</groupId>
            <artifactId>curator-recipes</artifactId>
            <version>${curator5_version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.curator</groupId>
            <artifactId>curator-x-discovery</artifactId>
            <version>${curator5_version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.zookeeper</groupId>
            <artifactId>zookeeper</artifactId>
            <version>${zookeeper_version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.curator</groupId>
            <artifactId>curator-test</artifactId>
            <version>${curator5_version}</version>
            <scope>test</scope>
            <exclusions>
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.apache.zookeeper</groupId>
                    <artifactId>zookeeper</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <!-- Nacos dependencies for testing -->
        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-registry-nacos</artifactId>
            <version>${project.parent.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.dubbo</groupId>
            <artifactId>dubbo-configcenter-nacos</artifactId>
            <version>${project.parent.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>com.alibaba.nacos</groupId>
            <artifactId>nacos-client</artifactId>
            <scope>test</scope>
        </dependency>
    </dependencies>
pinxiong commented 3 years ago

I have run all UT of dubbo-config-spring several times, and the average time elapsed at least 8 minutes. Unacceptably, there are still a large number of errors reported as below

java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:715)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:342)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1262)
21:54:55.371 [main-SendThread(127.0.0.1:2181)] DEBUG org.apache.zookeeper.ClientCnxnSocketNIO - Ignoring exception during shutdown input
java.nio.channels.ClosedChannelException: null
    at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:778)
    at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:399)
    at org.apache.zookeeper.ClientCnxnSocketNIO.cleanup(ClientCnxnSocketNIO.java:191)
    at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1362)
    at org.apache.zookeeper.ClientCnxn$SendThread.cleanAndNotifyState(ClientCnxn.java:1303)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1281)
21:54:55.371 [main-SendThread(127.0.0.1:2181)] DEBUG org.apache.zookeeper.ClientCnxnSocketNIO - Ignoring exception during shutdown output
java.nio.channels.ClosedChannelException: null
    at sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:795)
    at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:407)
    at org.apache.zookeeper.ClientCnxnSocketNIO.cleanup(ClientCnxnSocketNIO.java:196)
    at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1362)
    at org.apache.zookeeper.ClientCnxn$SendThread.cleanAndNotifyState(ClientCnxn.java:1303)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1281)
zrlw commented 3 years ago

那就奇了怪了,不过这个现象和之前有一个显著的不同,之前是启动就完蛋,不像dubbo单元测试这个重复N次。

zrlw commented 3 years ago

不知道是否和parallel testing有关联,很多测试类都在AfterEach里面写了DubboBootstrap.reset(),但是reset方法会释放全局资源,这样并行测试做reset操作会影响其他并行执行的测试方法。 我试试在自己的仓库改一下parent的pom.xml,把forkMode改成always之后构建几次对比一下看看效果, 开始了: https://github.com/zrlw/dubbo/actions/runs/1213871776

zrlw commented 3 years ago
  1. 拉取3.0当前最新代码,改了pom的forkMode为always(不过forkMode是Deprecated,现在推荐用的配置项名称是forkCount和reuseForks);
  2. 修改dubbo-config-spring的pom文件,把你上面提供的相关内容覆盖掉里面相关的内容。 本机构建,效果一样,还是KeeperErrorCode = ConnectionLoss和Ignoring exception during shutdown input这两东西刷屏,可能还真是碰到大雷了 更正:之前看错屏幕了,pom的forkMode为always的github构建日志没有出现上述内容,构建dubbo-rpc-dubbo就直接断言失败了(不知道是不是always forkMode导致的),还没有执行到dubbo-config-spring单元测试。 KeeperErrorCode = ConnectionLoss和Ignoring exception during shutdown input是pom的forkMode为once的github构建日志里出现的。
zrlw commented 3 years ago

我对forkMode的理解有错误, forkMode=once 对应的是 forkCount=1 && reuseForks=true 创建一个新进程用于每个测试(复用) forkMode=always 对应的是 forkCount=1 && reuseForks=false 为每一个测试创建一个新进程 从日志内容看,once forkMode方式下AfterEach的reset关闭zk连接的动作时,DubboSaveMetadataReport-thread还在做操作:

<== main线程开始执行AfterEach的reset操作
2021-09-06T23:17:04.3783873Z [06/09/21 23:17:04:369 UTC] main  INFO support.AbstractRegistryFactory:  [DUBBO] Close all registries [org.apache.dubbo.registry.client.ServiceDiscoveryRegistry@6faf0232, zookeeper://127.0.0.1:34385/org.apache.dubbo.registry.RegistryService?application=consumer-app&dubbo=2.0.2&environment=product&file.cache=false&interface=org.apache.dubbo.registry.RegistryService&pid=9244&release=3.0.3-SNAPSHOT], dubbo version: 3.0.3-SNAPSHOT, current host: 10.1.0.234  
2021-09-06T23:17:04.3788871Z [06/09/21 23:17:04:369 UTC] main  INFO zookeeper.ZookeeperRegistry:  [DUBBO] Destroy registry:zookeeper://127.0.0.1:34385/org.apache.dubbo.registry.RegistryService?application=consumer-app&dubbo=2.0.2&environment=product&file.cache=false&interface=org.apache.dubbo.registry.RegistryService&pid=9244&release=3.0.3-SNAPSHOT, dubbo version: 3.0.3-SNAPSHOT, current host: 10.1.0.234
<== DubboSaveMetadataReport还在工作
2021-09-06T23:17:04.3798680Z [06/09/21 23:17:04:372 UTC] DubboSaveMetadataReport-thread-1 ERROR zookeeper.ZookeeperMetadataReport:  [DUBBO] Failed to put consumer metadata org.apache.dubbo.metadata.report.identifier.MetadataIdentifier@27cfcb4f;  org.apache.dubbo.common.url.component.URLParam$URLParamMap@c7a36a43, cause: Expected state [STARTED] was [STOPPED], dubbo version: 3.0.3-SNAPSHOT, current host: 10.1.0.234
2021-09-06T23:17:04.3801771Z java.lang.IllegalStateException: Expected state [STARTED] was [STOPPED]  
...
<== main线程执行下一个测试方法的DubboBootstrap.start()
2021-09-06T23:17:04.6300294Z [06/09/21 23:17:04:530 UTC] main ERROR config.ServiceConfig:  [DUBBO] Failed register interface application mapping for service org.apache.dubbo.config.api.DemoService:2.0, dubbo version: 3.0.3-SNAPSHOT, current host: 10.1.0.234
2021-09-06T23:17:04.6301602Z java.lang.RuntimeException   

once搞不定,always也不可能搞定了。

pinxiong commented 3 years ago

我对forkMode的理解有错误, forkMode=once 对应的是 forkCount=1 && reuseForks=true 创建一个新进程用于每个测试(复用) forkMode=always 对应的是 forkCount=1 && reuseForks=false 为每一个测试创建一个新进程 从日志内容看,once forkMode方式下AfterEach的reset关闭zk连接的动作竟然在和下一个测试方法并行执行,once都搞不定,always也不可能搞定了。

I have focused on this problem too.

Dubbo only supports single instance(the multiple instances is developing now), there will be chaos if we remove DubboBootstrap.reset() in AfterEach. But I think we can have a try

zrlw commented 3 years ago

DubboSaveMetadataReport-thread这个线程在AfterEach执行DubboBootstrap.reset的时候还在工作,这个可能导致reset执行关闭zk的操作出现问题,从而影响后续测试进行zk注册

zrlw commented 3 years ago

是不是只能这样搞了:

  1. AfterEach的DubboBootstrap.reset里终止dubbo全部后台线程后再释放全局资源
  2. BeforeEach加上dubbo重新初始化的操作
pinxiong commented 3 years ago

是不是只能这样搞了:

  1. AfterEach的DubboBootstrap.reset里终止dubbo全部后台线程后再释放全局资源
  2. BeforeEach加上dubbo重新初始化的操作

This will take a lot of time by continuously creating and releasing resources.

zrlw commented 3 years ago

那么能不能在AbstractMetadataReport这些存在线程异步操作的类加上pause和resume接口呢?DubboBootstrap.reset调用pause接口,暂时停掉这些线程,DubboBootstrap.start调用这些类的resume接口

pinxiong commented 3 years ago

那么能不能在AbstractMetadataReport这些存在线程异步操作的类加上pause和resume接口呢?DubboBootstrap.reset调用pause接口,暂时停掉这些线程,DubboBootstrap.start调用这些类的resume接口

I think this is a temporary solution, we should solve how to manage the life cycle of resources and then to focus on this problem.

One of the goals of multi-instance is to solve this problem. You can see

zrlw commented 3 years ago

UT on windows-2019 without Integration Tests 的timeout设置80分钟已经不够用了。。。。 超时之前有时是 Dubbo-framework-scheduler-thread-1 ERROR bootstrap.DubboBootstrap: [DUBBO] Refresh instance and metadata error ... Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss 和 [main-SendThread(localhost:2181)] DEBUG org.apache.zookeeper.ClientCnxnSocketNIO - Ignoring exception during shutdown input java.nio.channels.ClosedChannelException: null 刷屏; 有时是 main WARN client.ServiceDiscoveryRegistry: [DUBBO] Cannot find app mapping for service org.apache.dubbo.config.api.DemoService, will not migrate. java.lang.IllegalStateException: the metadata report was not initialized. 刷屏。

zrlw commented 3 years ago

KeeperErrorCode = ConnectionLoss这个应该还是main线程执行AfterEach方法关闭zk和DubboSaveMetadataReport线程发生冲突所致,比如:

2021-09-06T23:44:55.1594610Z 23:44:53.968 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - shutting down
2021-09-06T23:44:55.1628141Z 23:44:54.072 [DubboSaveMetadataReport-thread-1] DEBUG org.apache.curator.RetryLoop - Retry-able exception received
2021-09-06T23:44:55.1630228Z org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /dubbo/metadata/org.apache.dubbo.config.spring.api.HelloService/demo/consumer
zrlw commented 3 years ago

总结一下截止目前发现的主要问题:

  1. 部分测试类执行结果影响后续测试类执行结果 有些测试类测试结束后,运行环境并未恢复,会影响下一个测试类执行结果,甚至导致下一个测试类的测试方法抛异常。 比如dubbo-config-api,对其单独做junit测试(先mvn install构建再junit test,junit可以在界面上直接了解当前测试进度和哪个测试方法存在问题,比mvn test直观), ReferenceConfigTest测试之后会影响ServiceConfigTest测试,后者testServiceListener方法执行MockServiceListener mockServiceListener = (MockServiceListener) extensionLoader.getExtension("mock")语句时会抛异常; 更糟糕的是还会导致testDelayExport测试方法直接hang掉,需要手工终止junit测试。 这个或许是UT构建时间timeout无论设置多大,最终都爆掉了的根本原因

  2. 部分测试类需要其他测试类预先设置环境(吐个槽,这种测试类还是UT么?) 比如DubboTelnetDecodeTest这个测试类,如果单独跑junit测试,就是这样的结果: [ERROR] DubboTelnetDecodeTest.testDubboDecode:127 expected: <1> but was: <0> [ERROR] DubboTelnetDecodeTest.testDubboDubboDecoded:381 expected: <2> but was: <0> [ERROR] DubboTelnetDecodeTest.testDubboTelnetDecoded:452 expected: <2> but was: <1> [ERROR] DubboTelnetDecodeTest.testTelnetDubboDecoded:243 expected: <2> but was: <1>

  3. 测试类每个测试方法执行结束后执行DubboBootstrap.reset()或destroy()释放全局资源(比如关闭zk连接)与仍在运行中的dubbo守护线程(如DubboSaveMetadataReport)之间存在冲突,可能触发后续测试方法执行失败。

解决问题1需要将forkMode设置为always:为每个测试类创建一个单独的进程,mvn test顺序执行每个测试类,以避免环境污染; 解决问题2需要将forkMode设置为once:mvn test顺序执行所有测试类时共享一个测试进程; 所以如果不修改测试类,问题1和2是无解的。

问题3可以在释放全局资源时停止相关的dubbo线程,代价是集成构建时间变长了,可以试试效果。

pinxiong commented 3 years ago

Thanks for your summary! It's very important for community to solve those issues.

To solve question 1 and 2 you've mentioned above, I totally agree with you that we need to update UT if want to fix them. I also raised some of these problems in the community weekly metting yesterday. I can share the conclusions with you below

After all of these are over, I think all of issues you raised above will have a lot of improvement. Admittedly, the question 1 and 2 cannot be completely solved, so your summary is very important at that time.

Are you interested in building dubbo-mock moudle with us first?

zrlw commented 3 years ago

期待早日看到3.0.3。 下周如果时间允许的话,我可以试试构建dubbo-mock

zrlw commented 3 years ago

翻看一下github构建日志,有一个值得注意的地方是ubuntu构建日志很少出现超时,大部分都是40分钟左右就完成了,但是windows2019构建日志出现的超时越来越频繁,虽然超时和UT代码自身缺陷、AfterEach释放全局资源与dubbo资源生命周期管理不匹配有关,但是为什么ubuntu执行就很少出问题呢? 更新:从构建日志看,ubuntu和windows2019测试类的执行顺序差别很大,ubuntu执行mvn测试dubbo-config-api时依次是MetadataServiceExporterTest、ServiceConfigTest、ProtocolConfigTest等等,而windows执行顺序依次是AbstractConfigTest、AbstractMethodConfigTest、AbstractReferenceConfigTest等等(好像是字典序)。 Default ordering used by this Maven plugin is filesystem, which means unit tests will execute in the order they are presented in the file system. Problem is that NTFS and Linux file systems do not provide same ordering - NTFS provides alphabetical ordering of files. unit tests should be independent of each other.The solution is to execute the tests in random order so that any dependencies between tests are found and can be removed. dubbo的UT存在依赖关系这个是导致构建出现问题的根源,latch.await挂住只是表象,加超时只能解决hang不能解决构建失败。 建议在pom增加runOrder配置,设为random,以尽快确定哪些测试类存在依赖关系,将存在依赖的相关测试类进行合并处理,这样做可能比加超时机制要好一些。

<plugin>
   <groupId>org.apache.maven.plugins</groupId>
   <artifactId>maven-surefire-plugin</artifactId>
   <configuration>
      <runOrder>random</runOrder>
   </configuration>
</plugin>

PS: 基本可以确定导致windows2019 github hang住的代码是countdownlatch.await(),严格来说单元测试代码里不应该出现无限等待的代码,await要带上超时时间,以避免万一出了问题把构建job直接hang掉。

pinxiong commented 3 years ago

翻看一下github构建日志,有一个值得注意的地方是ubuntu构建日志很少出现超时,大部分都是40分钟左右就完成了,但是windows2019构建日志出现的超时越来越频繁,虽然超时和UT代码自身缺陷、AfterEach释放全局资源与dubbo资源生命周期管理不匹配有关,但是为什么ubuntu执行就很少出问题呢? 现在基本定位导致github构建超时的代码就是countdownlatch.await()这条代码,严格来说单元测试代码里不应该出现无限等待的代码,await要带上超时时间,以避免万一出了问题把构建job直接hang掉。 同样的代码为什么ubuntu上和windows2019执行差异这么大,是不是CountDownLatch机制在两个操作系统上实现的机制存在差异?

Maybe you're right! Anyway, it's always right to have a timeout when using CountDownLatch#await

Now, I have raised a new feature to build dubbo-mock module I've mentioned above

Would you like to contribute with me?

zrlw commented 3 years ago

翻看一下github构建日志,有一个值得注意的地方是ubuntu构建日志很少出现超时,大部分都是40分钟左右就完成了,但是windows2019构建日志出现的超时越来越频繁,虽然超时和UT代码自身缺陷、AfterEach释放全局资源与dubbo资源生命周期管理不匹配有关,但是为什么ubuntu执行就很少出问题呢? 现在基本定位导致github构建超时的代码就是countdownlatch.await()这条代码,严格来说单元测试代码里不应该出现无限等待的代码,await要带上超时时间,以避免万一出了问题把构建job直接hang掉。 同样的代码为什么ubuntu上和windows2019执行差异这么大,是不是CountDownLatch机制在两个操作系统上实现的机制存在差异?

Maybe you're right! Anyway, it's always right to have a timeout when using CountDownLatch#await

Now, I have raised a new feature to build dubbo-mock module I've mentioned above

Would you like to contribute with me?

我试试看

zrlw commented 3 years ago

runOrder设为alphabetical之后,ubuntu同样会构建失败。 我把runOrder改成了random,同时把构建脚本里的--fail-fast删除,先看一下大概有多少存在依赖关系的UT吧。 https://github.com/zrlw/dubbo/actions/runs/1225612496