gardener / autoscaler

Customised fork of cluster-autoscaler to support machine-controller-manager
Apache License 2.0
16 stars 25 forks source link

Incorrect `cloudprovider` instances returned when fetching instances for machine deployments #307

Closed aaronfern closed 4 months ago

aaronfern commented 5 months ago

What happened:

func findMatchingInstance(nodes []*v1.Node, machine *v1alpha1.Machine) cloudprovider.Instance {
    for _, node := range nodes {
        if machine.Labels["node"] == node.Name {
            return cloudprovider.Instance{Id: node.Spec.ProviderID}
        }
    }
    // No k8s node found , one of the following cases possible
    //  - MCM is unable to fulfill the request to create VM.
    //  - VM is being created
    //  - the VM is up but has not registered yet

    // Report instance with a special placeholder ID so that the autoscaler can track it as an unregistered node.
    // Report InstanceStatus only for `ResourceExhausted` errors
    return cloudprovider.Instance{
        Id:     placeholderInstanceIDForMachineObj(machine.Name),
        Status: checkAndGetResourceExhaustedInstanceStatus(machine),
    }
}

The intention of the above function is to return a corresponding cloudprovider.Instance for a given machine. We have the following cases

  1. For machines that have already stated running, we would like to return an Instance with the corresponding node's ProviderID as it's ID. The Instance object here does not need a status as the corresponding node is already up and running and CA is not expected to take any action.
    1. For machines that are in the process of starting (those that do not have a corresponding node yet), we would need to return an Instance object that contains a placeholder as it's ID, and more importantly with the current machine objects status. This status is important as CA can take specific actions based on the current status specially if there are errors.

There is a case here where an empty status is returned in cases when the error is not ResourceExhausted. This causes an Instance object to be returned with a nil status. As a consequence CA can not take any action if there are some other errors in the machine objects status

What you expected to happen: In all cases where a machine does not have a corresponding node created yet, the returned Instance object must contain the status present in the machine.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

Environment: