gardener / machine-controller-manager

Declarative way of managing machines for Kubernetes cluster
Apache License 2.0
254 stars 116 forks source link

Orphan instance: InvalidInstanceAttributeValue: No device is currently mapped at /dev/sda1 #445

Closed ialidzhikov closed 4 years ago

ialidzhikov commented 4 years ago

What happened: We hit a case in which there is an ec2 instance that leaks and blocks the Shoot deletion (as the instance blocks the deletion of the subnet for the nodes and the security group for nodes).

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Expand the details the see all of the logs:

``` April 13th 2020, 11:52:01.822 FLAG: --leader-elect-resource-lock="endpoints" I April 13th 2020, 11:52:01.822 FLAG: --machine-creation-timeout="20m0s" I April 13th 2020, 11:52:01.822 FLAG: --machine-safety-apiserver-statuscheck-period="1m0s" I April 13th 2020, 11:52:01.822 FLAG: --v="3" I April 13th 2020, 11:52:01.822 FLAG: --concurrent-syncs="5" I April 13th 2020, 11:52:01.822 FLAG: --contention-profiling="false" I April 13th 2020, 11:52:01.822 FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf" I April 13th 2020, 11:52:01.822 FLAG: --leader-elect-retry-period="2s" I April 13th 2020, 11:52:01.822 FLAG: --log-flush-frequency="5s" I April 13th 2020, 11:52:01.822 FLAG: --machine-safety-apiserver-statuscheck-timeout="30s" I April 13th 2020, 11:52:01.822 FLAG: --safety-down="1" I April 13th 2020, 11:52:01.822 FLAG: --skip-headers="false" I April 13th 2020, 11:52:01.822 FLAG: --stderrthreshold="2" I April 13th 2020, 11:52:01.822 FLAG: --add-dir-header="false" I April 13th 2020, 11:52:01.822 FLAG: --cloud-provider="" I April 13th 2020, 11:52:01.822 FLAG: --leader-elect="true" I April 13th 2020, 11:52:01.822 FLAG: --leader-elect-renew-deadline="10s" I April 13th 2020, 11:52:01.822 FLAG: --logtostderr="true" I April 13th 2020, 11:52:01.822 FLAG: --node-conditions="KernelDeadlock,ReadonlyFilesystem,DiskPressure" I April 13th 2020, 11:52:01.822 FLAG: --profiling="true" I April 13th 2020, 11:52:01.822 FLAG: --safety-up="2" I April 13th 2020, 11:52:01.822 FLAG: --vmodule="" I April 13th 2020, 11:52:01.822 FLAG: --address="0.0.0.0" I April 13th 2020, 11:52:01.822 FLAG: --alsologtostderr="false" I April 13th 2020, 11:52:01.822 FLAG: --control-kubeconfig="inClusterConfig" I April 13th 2020, 11:52:01.822 FLAG: --kube-api-burst="30" I April 13th 2020, 11:52:01.822 FLAG: --kube-api-qps="20" I April 13th 2020, 11:52:01.822 FLAG: --leader-elect-lease-duration="15s" I April 13th 2020, 11:52:01.822 FLAG: --log-dir="" I April 13th 2020, 11:52:01.822 FLAG: --log-file="" I April 13th 2020, 11:52:01.822 FLAG: --log-file-max-size="1800" I April 13th 2020, 11:52:01.822 FLAG: --machine-max-evict-retries="2160" I April 13th 2020, 11:52:01.822 FLAG: --machine-pv-detach-timeout="2m0s" I April 13th 2020, 11:52:01.822 FLAG: --namespace="shoot--it--tmdtw-w19" I April 13th 2020, 11:52:01.822 FLAG: --skip-log-headers="false" I April 13th 2020, 11:52:01.822 FLAG: --target-kubeconfig="/var/lib/machine-controller-manager/kubeconfig" I April 13th 2020, 11:52:01.822 FLAG: --controller-start-interval="0s" I April 13th 2020, 11:52:01.822 FLAG: --log-backtrace-at=":0" I April 13th 2020, 11:52:01.822 FLAG: --machine-drain-timeout="2h0m0s" I April 13th 2020, 11:52:01.822 FLAG: --machine-health-timeout="10m0s" I April 13th 2020, 11:52:01.822 FLAG: --machine-safety-orphan-vms-period="30m0s" I April 13th 2020, 11:52:01.822 FLAG: --machine-safety-overshooting-period="1m0s" I April 13th 2020, 11:52:01.822 FLAG: --min-resync-period="12h0m0s" I April 13th 2020, 11:52:01.822 FLAG: --port="10258" I April 13th 2020, 11:52:01.823 Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. W April 13th 2020, 11:52:01.921 attempting to acquire leader lease shoot--it--tmdtw-w19/machine-controller-manager... I April 13th 2020, 11:52:19.552 successfully acquired lease shoot--it--tmdtw-w19/machine-controller-manager I April 13th 2020, 11:52:19.552 Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"shoot--it--tmdtw-w19", Name:"machine-controller-manager", UID:"16ca5ca6-18e4-40f1-a9a5-5782b9a6586d", APIVersion:"v1", ResourceVersion:"2883256968", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' machine-controller-manager-558f7cc94c-cl2xq became leader I April 13th 2020, 11:52:20.577 Starting shared informers I April 13th 2020, 11:52:20.578 Starting reflector *v1alpha1.MachineDeployment (12h0m0s) from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.578 Listing and watching *v1alpha1.GCPMachineClass from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.578 Starting reflector *v1alpha1.AlicloudMachineClass (12h0m0s) from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.578 Starting reflector *v1alpha1.PacketMachineClass (12h0m0s) from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.578 Listing and watching *v1alpha1.PacketMachineClass from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.578 Listing and watching *v1alpha1.MachineDeployment from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.578 Listing and watching *v1alpha1.AlicloudMachineClass from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.578 Starting reflector *v1alpha1.GCPMachineClass (12h0m0s) from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.720 Starting reflector *v1.Node (12h0m0s) from k8s.io/client-go/informers/factory.go:134 I April 13th 2020, 11:52:20.720 Listing and watching *v1alpha1.AWSMachineClass from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.720 Starting reflector *v1alpha1.AWSMachineClass (12h0m0s) from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.720 Listing and watching *v1.Node from k8s.io/client-go/informers/factory.go:134 I April 13th 2020, 11:52:20.820 Starting reflector *v1alpha1.AzureMachineClass (12h0m0s) from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.820 Listing and watching *v1.PersistentVolumeClaim from k8s.io/client-go/informers/factory.go:134 I April 13th 2020, 11:52:20.820 Listing and watching *v1alpha1.OpenStackMachineClass from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.820 Starting reflector *v1alpha1.Machine (12h0m0s) from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.820 Listing and watching *v1alpha1.Machine from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.820 Starting reflector *v1.PersistentVolume (12h0m0s) from k8s.io/client-go/informers/factory.go:134 I April 13th 2020, 11:52:20.820 Starting reflector *v1.PersistentVolumeClaim (12h0m0s) from k8s.io/client-go/informers/factory.go:134 I April 13th 2020, 11:52:20.820 Listing and watching *v1alpha1.MachineSet from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.820 Listing and watching *v1.PersistentVolume from k8s.io/client-go/informers/factory.go:134 I April 13th 2020, 11:52:20.820 Listing and watching *v1alpha1.AzureMachineClass from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.820 Listing and watching *v1.Secret from k8s.io/client-go/informers/factory.go:134 I April 13th 2020, 11:52:20.820 Starting reflector *v1alpha1.MachineSet (12h0m0s) from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:20.820 Starting reflector *v1.Secret (12h0m0s) from k8s.io/client-go/informers/factory.go:134 I April 13th 2020, 11:52:20.820 Starting reflector *v1alpha1.OpenStackMachineClass (12h0m0s) from github.com/gardener/machine-controller-manager/pkg/client/informers/externalversions/factory.go:117 I April 13th 2020, 11:52:21.877 Starting machine-controller-manager I April 13th 2020, 11:52:21.920 Processing the machinedeployment "shoot--it--tmdtw-w19-worker-1-z1" (with replicas 2) I April 13th 2020, 11:53:54.681 Processing the machinedeployment "shoot--it--tmdtw-w19-worker-1-z1" (with replicas 3) I April 13th 2020, 11:53:54.705 Event(v1.ObjectReference{Kind:"MachineDeployment", Namespace:"shoot--it--tmdtw-w19", Name:"shoot--it--tmdtw-w19-worker-1-z1", UID:"9b86950e-15af-45ea-8fd8-dbaf1d6968e7", APIVersion:"machine.sapcloud.io/v1alpha1", ResourceVersion:"2883263232", FieldPath:""}): type: 'Normal' reason: 'ScalingMachineSet' Scaled up machine set shoot--it--tmdtw-w19-worker-1-z1-5b88647f77 to 3 I April 13th 2020, 11:53:54.764 Too few replicas for MachineSet shoot--it--tmdtw-w19-worker-1-z1-5b88647f77, need 3, creating 1 I April 13th 2020, 11:53:54.794 Processing the machinedeployment "shoot--it--tmdtw-w19-worker-1-z1" (with replicas 3) I April 13th 2020, 11:53:54.794 Controller shoot--it--tmdtw-w19-worker-1-z1-5b88647f77 created machine shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj I April 13th 2020, 11:53:54.820 Event(v1.ObjectReference{Kind:"MachineSet", Namespace:"shoot--it--tmdtw-w19", Name:"shoot--it--tmdtw-w19-worker-1-z1-5b88647f77", UID:"ccc83289-ef57-435a-abf5-81c43c66b495", APIVersion:"machine.sapcloud.io/v1alpha1", ResourceVersion:"2883263237", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created Machine: shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj I April 13th 2020, 11:53:54.843 Too few replicas for MachineSet shoot--it--tmdtw-w19-worker-1-z1-5b88647f77, need 3, creating 1 I April 13th 2020, 11:53:54.925 Processing the machinedeployment "shoot--it--tmdtw-w19-worker-1-z1" (with replicas 3) I April 13th 2020, 11:53:54.925 Event(v1.ObjectReference{Kind:"MachineSet", Namespace:"shoot--it--tmdtw-w19", Name:"shoot--it--tmdtw-w19-worker-1-z1-5b88647f77", UID:"ccc83289-ef57-435a-abf5-81c43c66b495", APIVersion:"machine.sapcloud.io/v1alpha1", ResourceVersion:"2883263247", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created Machine: shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-zkzkq I April 13th 2020, 11:53:54.925 Controller shoot--it--tmdtw-w19-worker-1-z1-5b88647f77 created machine shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-zkzkq I April 13th 2020, 11:53:54.948 Processing the machinedeployment "shoot--it--tmdtw-w19-worker-1-z1" (with replicas 3) I April 13th 2020, 11:53:55.043 Too many replicas for shoot--it--tmdtw-w19/shoot--it--tmdtw-w19-worker-1-z1-5b88647f77, need 3, deleting 1 I April 13th 2020, 11:53:55.043 Controller shoot--it--tmdtw-w19-worker-1-z1-5b88647f77 deleting machine shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj I April 13th 2020, 11:53:55.067 Event(v1.ObjectReference{Kind:"MachineSet", Namespace:"shoot--it--tmdtw-w19", Name:"shoot--it--tmdtw-w19-worker-1-z1-5b88647f77", UID:"ccc83289-ef57-435a-abf5-81c43c66b495", APIVersion:"machine.sapcloud.io/v1alpha1", ResourceVersion:"2883263247", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted machine: shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj I April 13th 2020, 11:53:55.087 Warning: Updated failed, retrying, error: "Operation cannot be fulfilled on machines.machine.sapcloud.io \"shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj\": StorageError: invalid object, Code: 4, Key: /registry/machine.sapcloud.io/machines/shoot--it--tmdtw-w19/shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 7f9f230a-5492-4056-94f6-29b563569d55, UID in object meta: " W April 13th 2020, 11:53:55.088 Creating machine "shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-zkzkq", please wait! I April 13th 2020, 11:53:55.089 Delete machine from machineset "shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj" I April 13th 2020, 11:53:55.120 Creating machine "shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj", please wait! I April 13th 2020, 11:53:55.220 Processing the machinedeployment "shoot--it--tmdtw-w19-worker-1-z1" (with replicas 3) I April 13th 2020, 11:53:55.266 Processing the machinedeployment "shoot--it--tmdtw-w19-worker-1-z1" (with replicas 3) I April 13th 2020, 11:53:56.396 Created/Adopted machine: "shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-zkzkq", MachineID: aws:///eu-west-1/i-020339c05fe741da4 I April 13th 2020, 11:53:56.721 Created/Adopted machine: "shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj", MachineID: aws:///eu-west-1/i-080641c8fcb74ec59 I April 13th 2020, 11:53:56.744 Machine "shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj" not found on APIServer anymore. Deleting created (orphan) VM I April 13th 2020, 11:53:57.019 Deletion failed for orphan machine "" with provider-ID "aws:///eu-west-1/i-080641c8fcb74ec59": InvalidInstanceAttributeValue: No device is currently mapped at /dev/sda1 E April 13th 2020, 11:53:57.019 u0009status code: 400, request id: 87b90f35-962b-4f6c-a18f-7f7fe5c841a1 - April 13th 2020, 11:53:57.019 u0009status code: 400, request id: 87b90f35-962b-4f6c-a18f-7f7fe5c841a1 - April 13th 2020, 11:53:57.019 InvalidInstanceAttributeValue: No device is currently mapped at /dev/sda1 E April 13th 2020, 11:55:37.027 Machine shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-zkzkq successfully joined the cluster I April 13th 2020, 11:55:37.141 Processing the machinedeployment "shoot--it--tmdtw-w19-worker-1-z1" (with replicas 3) I April 13th 2020, 11:55:37.186 Processing the machinedeployment "shoot--it--tmdtw-w19-worker-1-z1" (with replicas 3) I April 13th 2020, 12:02:21.947 Processing the machinedeployment "shoot--it--tmdtw-w19-worker-1-z1" (with replicas 3) I April 13th 2020, 12:08:53.558 Unexpected EOF during watch stream event decoding: unexpected EOF I April 13th 2020, 12:08:53.558 Unexpected EOF during watch stream event decoding: unexpected EOF I April 13th 2020, 12:08:53.558 Unexpected EOF during watch stream event decoding: unexpected EOF I April 13th 2020, 12:08:54.574 k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.PersistentVolumeClaim: Get https://kube-apiserver.shoot--it--tmdtw-w19/api/v1/persistentvolumeclaims?allowWatchBookmarks=trueu0026resourceVersion=10395u0026timeout=8m46su0026timeoutSeconds=526u0026watch=true: dial tcp 10.243.101.69:443: connect: connection refused E April 13th 2020, 12:08:54.574 k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: Get https://kube-apiserver.shoot--it--tmdtw-w19/api/v1/nodes?allowWatchBookmarks=trueu0026resourceVersion=14645u0026timeout=7m1su0026timeoutSeconds=421u0026watch=true: dial tcp 10.243.101.69:443: connect: connection refused E ```

The critical part from the logs seems to be

April 13th 2020, 11:53:55.043   Too many replicas for  shoot--it--tmdtw-w19/shoot--it--tmdtw-w19-worker-1-z1-5b88647f77, need 3, deleting 1 I
April 13th 2020, 11:53:55.043   Controller shoot--it--tmdtw-w19-worker-1-z1-5b88647f77 deleting machine shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj   I
April 13th 2020, 11:53:55.067   Event(v1.ObjectReference{Kind:"MachineSet", Namespace:"shoot--it--tmdtw-w19", Name:"shoot--it--tmdtw-w19-worker-1-z1-5b88647f77", UID:"ccc83289-ef57-435a-abf5-81c43c66b495", APIVersion:"machine.sapcloud.io/v1alpha1", ResourceVersion:"2883263247", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted machine: shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj I
April 13th 2020, 11:53:55.087   Warning: Updated failed, retrying, error: "Operation cannot be fulfilled on machines.machine.sapcloud.io \"shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj\": StorageError: invalid object, Code: 4, Key: /registry/machine.sapcloud.io/machines/shoot--it--tmdtw-w19/shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 7f9f230a-5492-4056-94f6-29b563569d55, UID in object meta: "    W
April 13th 2020, 11:53:55.088   Creating machine "shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-zkzkq", please wait!  I
April 13th 2020, 11:53:55.089   Delete machine from machineset "shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj"  I
April 13th 2020, 11:53:55.120   Creating machine "shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj", please wait!  I
April 13th 2020, 11:53:55.220   Processing the machinedeployment "shoot--it--tmdtw-w19-worker-1-z1" (with replicas 3)   I
April 13th 2020, 11:53:55.266   Processing the machinedeployment "shoot--it--tmdtw-w19-worker-1-z1" (with replicas 3)   I
April 13th 2020, 11:53:56.396   Created/Adopted machine: "shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-zkzkq", MachineID: aws:///eu-west-1/i-020339c05fe741da4   I
April 13th 2020, 11:53:56.721   Created/Adopted machine: "shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj", MachineID: aws:///eu-west-1/i-080641c8fcb74ec59   I
April 13th 2020, 11:53:56.744   Machine "shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj" not found on APIServer anymore. Deleting created (orphan) VM    I
April 13th 2020, 11:53:57.019   Deletion failed for orphan machine "" with provider-ID "aws:///eu-west-1/i-080641c8fcb74ec59": InvalidInstanceAttributeValue: No device is currently mapped at /dev/sda1    E

The existing (orphan) instance is with id i-080641c8fcb74ec59 and Name tag {"Key": "Name", "Value": "shoot--it--tmdtw-w19-worker-1-z1-5b88647f77-wtmpj"}.

Anything else we need to know:

Environment: machine-controller-manager version: v0.27.0

prashanth26 commented 4 years ago

Hi @ialidzhikov ,

Thanks for this info. I think this is caused by this call - https://github.com/gardener/machine-controller-manager/blob/master/pkg/driver/driver_aws.go#L288. This call was introduced to capture a bug in deletion of all volumes attached to machines by default. However, in this case, maybe the disk was never attached properly or something. We will look into this.

prashanth26 commented 4 years ago

We can continue the discussion of how to best fix this here. Also, I am not 100% sure how to reproduce this but i think on shoot deletions this should be re-producible.

CC: @hardikdr @ggaurav10 @amshuman-kr