alibaba / spring-cloud-alibaba

Spring Cloud Alibaba provides a one-stop solution for application development for the distributed solutions of Alibaba middleware.
https://sca.aliyun.com
Apache License 2.0
27.94k stars 8.33k forks source link

在liveness中配置/actuator/health,在应用请求nacos失败时,会导致pod异常重启 #3535

Closed ZhXZhao closed 7 months ago

ZhXZhao commented 11 months ago

我们鼓励使用英文,如果不能直接使用,可以使用翻译软件,您仍旧可以保留中文原文。另外请按照如下要求提交相关信息节省社区维护同学的理解成本,否则该讨论极有可能直接被忽视或关闭。 We recommend using English. If you are non-native English speaker, you can use the translation software. We recommend using English. If you are non-native English speaker, you can use the translation software. In addition, please submit relevant information according to the following requirements to save the understanding cost of community maintenances, otherwise the discussion is very likely to be ignored or closed directly.

Which Component

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
            <version>2.3.12.RELEASE</version>
        </dependency>

Describe the bug k8s应用在liveness中配置/actuator/health,应用请求nacos失败时,会导致pod异常重启,应用自身状态无异常,可以接受外部请求。

To Reproduce Steps to reproduce the behavior:

  1. 启动一个简单的SpringCloudAlibaba应用,并引入nacos config、discovery以及actuator依赖
        <dependency>
            <groupId>com.alibaba.cloud</groupId>
            <artifactId>spring-cloud-starter-alibaba-nacos-config</artifactId>
        </dependency>
        <dependency>
            <groupId>com.alibaba.cloud</groupId>
            <artifactId>spring-cloud-starter-alibaba-nacos-discovery</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>
  2. 访问http://localhost:8088/actuator/health
    {
    "status": "UP",
    "components": {
        "discoveryComposite": {
            "status": "UP",
            "components": {
                "discoveryClient": {
                    "status": "UP",
                    "details": {
                        "services": [
                            "provider"
                        ]
                    }
                }
            }
        },
        "diskSpace": {
            "status": "UP",
            "details": {
                "total": 499963174912,
                "free": 285202673664,
                "threshold": 10485760,
                "exists": true
            }
        },
        "nacosConfig": {
            "status": "UP"
        },
        "nacosDiscovery": {
            "status": "UP"
        },
        "ping": {
            "status": "UP"
        },
        "refreshScope": {
            "status": "UP"
        }
    }
    }
  3. 通过添加网络白名单策略,模拟应用与nacos之间断连的情况,再次访问http://localhost:8088/actuator/health
    {
    "status": "DOWN",
    "components": {
        "discoveryComposite": {
            "status": "UP",
            "components": {
                "discoveryClient": {
                    "status": "UP",
                    "details": {
                        "services": []
                    }
                }
            }
        },
        "diskSpace": {
            "status": "UP",
            "details": {
                "total": 499963174912,
                "free": 284207394816,
                "threshold": 10485760,
                "exists": true
            }
        },
        "nacosConfig": {
            "status": "UP"
        },
        "nacosDiscovery": {
            "status": "DOWN"
        },
        "ping": {
            "status": "UP"
        },
        "refreshScope": {
            "status": "UP"
        }
    }
    }
  4. 此时actuator判断应用的健康状态为"DOWN",但实际上应用实际仍可以对外提供服务

Expected behavior 在SCA应用与nacos之间连接失败时,/actuator/health状态不应该为"DOWN"

Additional context SCA Version 2.2.9.RELEASE

chickenlj commented 10 months ago

In general, the "Liveness" state should not be based on external checks, such as Health checks. If it did, a failing external system (a database, a Web API, an external cache) would trigger massive restarts and cascading failures across the platform.

chickenlj commented 10 months ago

I think there's no need to change the implementations of NacosConfigHealthIndicator and NacosDiscoveryHealthIndicator, they both work well for telling actuator/health the status of Nacos.

Spring Boot has already provided Liveness and Readiness probes since 2.3.x, users should always use /actuator/health/liveness and /actuator/health/readiness for container lifecycle testing instead of using /actuator/health:

yuluo-yx commented 10 months ago

I think there's no need to change the implementations of NacosConfigHealthIndicator and NacosDiscoveryHealthIndicator, they both work well for telling actuator/health the status of Nacos.

Spring Boot has already provided Liveness and Readiness probes since 2.3.x, users should always use /actuator/health/liveness and /actuator/health/readiness for container lifecycle testing instead of using /actuator/health:

Agree.

chickenlj commented 10 months ago

We need one article on sca.aliyun.com giving best practices for deploying Spring Cloud Alibaba on Kubernetes.

ZhXZhao commented 10 months ago

I think there's no need to change the implementations of NacosConfigHealthIndicator and NacosDiscoveryHealthIndicator, they both work well for telling actuator/health the status of Nacos.

Spring Boot has already provided Liveness and Readiness probes since 2.3.x, users should always use /actuator/health/liveness and /actuator/health/readiness for container lifecycle testing instead of using /actuator/health:

Thanks. Could you give some suggestions for users using spring boot version lower than 2.3.x?

github-actions[bot] commented 8 months ago

This issue has been open 30 days with no activity. This will be closed in 7 days.

yuluo-yx commented 8 months ago

hi, @ZhXZhao I have written a best practice here. Do you have time to review?

https://github.com/yuluo-yx/sca-k8s-demo/tree/openfeign

ZhXZhao commented 7 months ago

hi, @ZhXZhao I have written a best practice here. Do you have time to review?

https://github.com/yuluo-yx/sca-k8s-demo/tree/openfeign

Okay, that's great