apache / dubbo

The java implementation of Apache Dubbo. An RPC and microservice framework.
https://dubbo.apache.org/
Apache License 2.0
40.42k stars 26.41k forks source link

当服务的所有provider下线后,consumer不会释放DubboInvoker及NettyClient等资源,而是不断的重连 #9965

Closed saleson closed 2 years ago

saleson commented 2 years ago

Environment

Steps to reproduce this issue

1、启动provider 2、启动consumer 3、调用一次RPC 4、停止provider 5、等60秒左右

Pls. provide [GitHub address] to reproduce this issue.

原因: ServiceDiscoveryRegistryDirectory.refreshInvoker()和RegistryDirectory.refreshInvoker()方法中的代码逻辑在本场景中存在问题(逻辑顺序反了)

            this.urlInvokerMap = newUrlInvokerMap;  // urlnvokerMap = newUrlInvokerMap = empty
            try {
                //call destroyAllInvokers()
                destroyUnusedInvokers(oldUrlInvokerMap, newUrlInvokerMap); // Close the unused Invoker
            } catch (Exception e) {
                logger.warn("destroyUnusedInvokers error. ", e);
            }

而destroyAllInvokers()方法中是重新判断urlInvokerMap是否为空,再从urlInvokerMap中获取invoker调用其destroyAll()方法,而此时urlInvokerMap是空,所以invoker不会被destroy

protected void destroyAllInvokers() {
        Map<URL, Invoker<T>> localUrlInvokerMap = this.urlInvokerMap; // local reference
        if (!CollectionUtils.isEmptyMap(localUrlInvokerMap)) {
            for (Invoker<T> invoker : new ArrayList<>(localUrlInvokerMap.values())) {
                try {
                    invoker.destroyAll();
                } catch (Throwable t) {
                    logger.warn("Failed to destroy service " + serviceKey + " to provider " + invoker.getUrl(), t);
                }
            }
            localUrlInvokerMap.clear();
        }
        this.urlInvokerMap = null;
        this.cachedInvokerUrls = null;
        destroyInvokers();
    }

Expected Behavior

释放DubboInvoker

Actual Behavior

没有释放 DubboInvoker及NettyClient等资源,ReconnectTimerTask会不断检查连接状态以及尝试重连然后报错(60秒间隔)

If there is an exception, please attach the exception trace:

[24/04/22 21:18:37:430 CST] dubbo-client-idleCheck-thread-1 ERROR header.ReconnectTimerTask: [DUBBO] Fail to connect to HeaderExchangeClient [channel=org.apache.dubbo.remoting.transport.netty4.NettyClient [/192.168.1.7:57600 -> /192.168.1.7:20880]], dubbo version: 3.0.8-SNAPSHOT, current host: 192.168.1.7 org.apache.dubbo.remoting.RemotingException: client(url: dubbo://192.168.1.7:20880/org.apache.dubbo.springboot.demo.DemoService?anyhost=true&application=dubbo-springboot-demo-consumer&background=false&category=providers,configurators,routers&check=false&codec=dubbo&deprecated=false&dubbo=2.0.2&dynamic=true&generic=false&heartbeat=60000&interface=org.apache.dubbo.springboot.demo.DemoService&methods=sayHello,sayHelloAsync&pid=18417&qos.enable=false&register-mode=interface&release=3.0.8-SNAPSHOT&side=consumer&sticky=false) failed to connect to server /192.168.1.7:20880, error message is:Connection refused: /192.168.1.7:20880 at org.apache.dubbo.remoting.transport.netty4.NettyClient.doConnect(NettyClient.java:192) at org.apache.dubbo.remoting.transport.AbstractClient.connect(AbstractClient.java:214) at org.apache.dubbo.remoting.transport.AbstractClient.reconnect(AbstractClient.java:268) at org.apache.dubbo.remoting.exchange.support.header.HeaderExchangeClient.reconnect(HeaderExchangeClient.java:171) at org.apache.dubbo.remoting.exchange.support.header.ReconnectTimerTask.doTask(ReconnectTimerTask.java:49) at org.apache.dubbo.remoting.exchange.support.header.AbstractTimerTask.run(AbstractTimerTask.java:87) at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:651) at org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:730) at org.apache.dubbo.common.timer.HashedWheelTimer$Worker.run(HashedWheelTimer.java:452) at java.lang.Thread.run(Thread.java:750) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.1.7:20880 Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:707) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:750)

AlbumenJ commented 2 years ago

如果注册中心存在空地址的情况会走到下面的逻辑进行销毁

    private void refreshInvoker(List<URL> invokerUrls) {
        Assert.notNull(invokerUrls, "invokerUrls should not be null, use EMPTY url to clear current addresses.");
        this.originalUrls = invokerUrls;

        if (invokerUrls.size() == 1 && EMPTY_PROTOCOL.equals(invokerUrls.get(0).getProtocol())) {
            logger.warn("Received url with EMPTY protocol, will clear all available addresses.");
            this.forbidden = true; // Forbid to access
            routerChain.setInvokers(BitList.emptyList());
            destroyAllInvokers(); // Close all invokers
        } 
}

即使转订阅为空也会走这里快速返回,不会 destroy

            if (CollectionUtils.isEmptyMap(newUrlInvokerMap)) {
                logger.error(new IllegalStateException("Cannot create invokers from url address list (total " + invokerUrls.size() + ")"));
                return;
            }

这个应该是触发了注册中心空地址保护的逻辑了,在注册中心地址为空的时候拿前一次非空的结果进行处理,避免由于注册中心不可用造成的抖动,可以通过以下配置关闭

RegistryConfig.java
    @Parameter(key = ENABLE_EMPTY_PROTECTION_KEY)
    public Boolean getEnableEmptyProtection() {
        return enableEmptyProtection;
    }

    public void setEnableEmptyProtection(Boolean enableEmptyProtection) {
        this.enableEmptyProtection = enableEmptyProtection;
    }
saleson commented 2 years ago

对于不可预测的抖动进行空地址保护的逻辑我认为是很棒的,同时我也建议做一些优化处理,避免正常下线被当作抖动而被忽略掉,且是永久忽略。 以下是个人溥见,比如可以添加后续的监控或检测任务,如果时间段内都是注册中心正常可用且确实没有provider,就执行destroy逻辑;或者根据实例下降比例进行判断。

chickenlj commented 2 years ago

避免正常下线被当作抖动而被忽略掉,且是永久忽略。

这种推空只在所有地址都下线的极端情况下才会出现,对于正常的线上集群实践应该不会出现。如确有需求,可以通过开关把推空保护关闭就好了。

saleson commented 2 years ago

关闭推空保护就相当于降低了容错,在线上虽然出现推空这种情况的概率非常小,但一旦出现就可能会造故障。我想dubbo当时提供推空保护也是出于这种考虑吧。 而且在生产环境这种场景还是会有存在的,比如dubbo 接口下线,有引用的consumer应该是会出现这种情况的。