gardener / machine-controller-manager

Declarative way of managing machines for Kubernetes cluster
Apache License 2.0
253 stars 116 forks source link

Check instance reachable status in machine-controller-manager while checking new machine joining machine deployment #729

Open neo-liang-sap opened 2 years ago

neo-liang-sap commented 2 years ago

How to categorize this issue? /area control-plane /kind enhancement /priority 3

What would you like to be added:

in AWS, sometimes instance is running but not reachable, in aws there's a command to check this reachable status aws ec2 describe-instance-status --instance-ids i-01e71990bfe658adc

aws ec2 describe-instance-status --instance-ids i-01e71990bfe658adc
{
    "InstanceStatuses": [
        {
            "AvailabilityZone": "eu-central-1a",
            "InstanceId": "i-01e71990bfe658adc",
            "InstanceState": {
                "Code": 16,
                "Name": "running"
            },
            "InstanceStatus": {
                "Details": [
                    {
                        "ImpairedSince": "2022-06-21T06:28:00+00:00",
                        "Name": "reachability",
                        "Status": "failed"
                    }
                ],
                "Status": "impaired"
            },
            "SystemStatus": {
                "Details": [
                    {
                        "Name": "reachability",
                        "Status": "passed"
                    }
                ],
                "Status": "ok"
            }
        }
    ]
}

this instance is running but not reachable

Is it possible to add some check in MCM whether the instance is reachable?

Why is this needed:

To have better understanding what's the process of machine joining the cluster, e.g. sometime machine created, after 20mins, deleted by MCM and recreated another one....

CC @dguendisch

gardener-robot commented 2 years ago

@neo-liang-sap Label area/todo does not exist.

himanshu-kun commented 2 years ago

Yes we will work on adding such feature. Some research is required first to see if other providers also provide such networking info of an instance directly or not.

himanshu-kun commented 1 year ago

Post Grooming discussion

We need to enhance driver method GetMachineStatus to also do some checks like reachability mentioned above, and enahance GetMachineStatusResponse to contain the result of the check. Then we should update the error in machine status to reflect that, so that it goes till the status of higher level controllers and get reflected in dashboard for user to see.