volume mount failures due to "... 422 Instance is already attached to this volume"

zeman412 commented 5 years ago

After installing the csi driver, I was able to create the volume claim and it worked fine for the first time. However, the problem happens when deleting deployment and trying to mount the existing volume claim.

root@ewr1-controller:~/packet_taurus# kubectl get pvc
No resources found.
root@ewr1-controller:~/packet_taurus# kubectl create -f pvc-mysql.yaml 
persistentvolumeclaim/mysql-volumeclaim created
root@ewr1-controller:~/packet_taurus# kubectl get pvc
NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS          AGE
mysql-volumeclaim   Bound    pvc-6a577e83-ce99-4862-9947-5a33509464f2   10Gi       RWO            csi-packet-standard   6s
root@ewr1-controller:~/packet_taurus# kubectl create -f mysql.yaml 
service/magento-mysql created
deployment.apps/magento-mysql created
root@ewr1-controller:~/packet_taurus# kubectl get pods
NAME                           READY   STATUS    RESTARTS   AGE
magento-mysql-fbd8dbc6-rsxfp   1/1     Running   0          64s
root@ewr1-controller:~/packet_taurus# kubectl delete -f mysql.yaml 
service "magento-mysql" deleted
deployment.apps "magento-mysql" deleted
root@ewr1-controller:~/packet_taurus# kubectl get pvc
NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS          AGE
mysql-volumeclaim   Bound    pvc-6a577e83-ce99-4862-9947-5a33509464f2   10Gi       RWO            csi-packet-standard   3m33s
root@ewr1-controller:~/packet_taurus# kubectl create -f mysql.yaml 
service/magento-mysql created
deployment.apps/magento-mysql created
root@ewr1-controller:~/packet_taurus# kubectl get pods
NAME                           READY   STATUS              RESTARTS   AGE
magento-mysql-fbd8dbc6-kll6f   0/1     ContainerCreating   0          80s
root@ewr1-controller:~/packet_taurus#

Mysql deployment was running fine, then I deleted the deployment and recreated it and as shown above myql fails to mount the pvc this time.

$ kubectl describe pod magento-mysql-fbd8dbc6-kll6f
.
.
.
Events:
  Type     Reason              Age                   From                                 Message
  ----     ------              ----                  ----                                 -------
  Normal   Scheduled           8m34s                 default-scheduler                    Successfully assigned default/magento-mysql-fbd8dbc6-kll6f to ewr1-t1.small.x86-worker-1
  Warning  FailedMount         119s (x3 over 6m31s)  kubelet, ewr1-t1.small.x86-worker-1  Unable to mount volumes for pod "magento-mysql-fbd8dbc6-kll6f_default(d8127740-869e-45a8-b6c2-18f76401aa56)": timeout expired waiting for volumes to attach or mount for pod "default"/"magento-mysql-fbd8dbc6-kll6f". list of unmounted volumes=[mysql-persistent-storage]. list of unattached volumes=[mysql-persistent-storage mysql-config-volume default-token-blnv7]
  Warning  FailedAttachVolume  18s (x12 over 8m32s)  attachdetach-controller              AttachVolume.Attach failed for volume "pvc-6a577e83-ce99-4862-9947-5a33509464f2" : rpc error: code = Unknown desc = error attempting to attach cfdeaa23-f9b3-4ec0-a324-29185b1546dd to 167e1997-0802-449e-97e0-fbbf4a6d7e2b, POST https://api.packet.net/storage/cfdeaa23-f9b3-4ec0-a324-29185b1546dd/attachments: 422 Instance is already attached to this volume

I followed the README for packet csi and created the pvc exactly as the example provided.

deitch commented 5 years ago

Thanks @zeman412 for reporting it.

We will try to recreate. At first blush, it sounds like an implementation bug rather than a documentation issue. Either way, we will get it and update here.

deitch commented 4 years ago

@zeman412 can you post your pvc and pod or deployment yaml?

deitch commented 4 years ago

@zeman412 I was not able to recreate it. I followed your steps:

Create the pvc
Create the deployment
Wait for success
Delete the deployment
Recreate the deployment

But it succeeded in reattaching. Do you have nodes up and running that we can see the failure?

zeman412 commented 4 years ago

@deitch yes those are the steps that raise the problem and I just reproduced the error with the nginx deployment example included in the csi-packet/deploy/demo/demo-deployment.yaml.. And I observed that the problem may not always occur on the first attempt, but if I repeat the procedure for the 2nd/3rd times then the error occurs. Here is the details:

pvc-nginx.yaml

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: podpvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: csi-packet-standard
resources:
requests:
  storage: 1Gi

nginx.yaml


kind: Deployment
apiVersion: apps/v1
metadata:
labels:
run: nginx
name: nginx
spec:
replicas: 1
selector:
matchLabels:
  run: nginx
strategy:
rollingUpdate:
  maxSurge: 1
  maxUnavailable: 1
type: RollingUpdate
template:
metadata:
  name: web-server
  labels:
    run: nginx
spec:
  # nodeSelector:
  #   kubernetes.io/hostname: "10.88.52.141"
  containers:
  - image: nginx
    name: nginx
    volumeMounts:
      - mountPath: /var/lib/www/html
        name: mypvc
  volumes:
  - name: mypvc
    persistentVolumeClaim:
      claimName: podpvc
      readOnly: false

Then created pvc and deployment:

kubectl create -f pvc-nginx.yaml

persistentvolumeclaim/podpvc created
kubectl create -f nginx.yaml

deployment.apps/nginx created

Wait for success:

kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
nginx-7f545b55b8-slj2r   1/1     Running   0          59s

Delete the deployment:

kubectl delete -f nginx.yaml 
deployment.apps "nginx" deleted
# kubectl get pods
No resources found.

Recreate the deployment: 5.1: First attempt:

# kubectl create -f nginx.yaml 
deployment.apps/nginx created
root@ewr1-controller:~/csi-packet# kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
nginx-7f545b55b8-mms5f   1/1     Running   0          41s

5.2: Second attempt:

# kubectl delete -f nginx.yaml 
deployment.apps "nginx" deleted
# kubectl get pods
No resources found.

Now recreate the deployment:

# kubectl create -f nginx.yaml 
deployment.apps/nginx created
# kubectl get pods
NAME                     READY   STATUS              RESTARTS   AGE
nginx-7f545b55b8-qwqmr   0/1     ContainerCreating   0          17m

It got stuck there:

# kubectl describe pods nginx-7f545b55b8-qwqmr
Name:           nginx-7f545b55b8-qwqmr
Namespace:      default
Priority:       0
Node:           ewr1-m2.xlarge.x86-worker-5/10.99.142.13
Start Time:     Mon, 25 Nov 2019 15:27:12 +0000
Labels:         pod-template-hash=7f545b55b8
            run=nginx
Annotations:    <none>
Status:         Pending
IP:             
Controlled By:  ReplicaSet/nginx-7f545b55b8
Containers:
nginx:
Container ID:   
Image:          nginx
Image ID:       
Port:           <none>
Host Port:      <none>
State:          Waiting
  Reason:       ContainerCreating
Ready:          False
Restart Count:  0
Environment:    <none>
Mounts:
  /var/lib/www/html from mypvc (rw)
  /var/run/secrets/kubernetes.io/serviceaccount from default-token-8gj6m (ro)
Conditions:
Type              Status
Initialized       True 
Ready             False 
ContainersReady   False 
PodScheduled      True 
Volumes:
mypvc:
Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName:  podpvc
ReadOnly:   false
default-token-8gj6m:
Type:        Secret (a volume populated by a Secret)
SecretName:  default-token-8gj6m
Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
             node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type     Reason              Age                 From                                  Message
----     ------              ----                ----                                  -------
Normal   Scheduled           18m                 default-scheduler                     Successfully assigned default/nginx-7f545b55b8-qwqmr to ewr1-m2.xlarge.x86-worker-5
Warning  FailedAttachVolume  18m                 attachdetach-controller               Multi-Attach error for volume "pvc-16b22d0e-2151-4f81-905b-a8163552d2eb" Volume is already exclusively attached to one node and can't be attached to another
Warning  FailedMount         64s (x8 over 16m)   kubelet, ewr1-m2.xlarge.x86-worker-5  Unable to mount volumes for pod "nginx-7f545b55b8-qwqmr_default(53a82355-4763-4318-b4bd-68c2ab7a8990)": timeout expired waiting for volumes to attach or mount for pod "default"/"nginx-7f545b55b8-qwqmr". list of unmounted volumes=[mypvc]. list of unattached volumes=[mypvc default-token-8gj6m]
Warning  FailedAttachVolume  26s (x17 over 18m)  attachdetach-controller               AttachVolume.Attach failed for volume "pvc-16b22d0e-2151-4f81-905b-a8163552d2eb" : rpc error: code = Unknown desc = error attempting to attach b1b3f63d-526a-4615-8a90-b109ecfff21f to f7c652f0-30d3-471b-89b7-5380c4d6bf27, POST https://api.packet.net/storage/b1b3f63d-526a-4615-8a90-b109ecfff21f/attachments: 422 Instance is already attached to this volume

We are working on performance tuning and automating the whole process, thus, I had to recreate the deployment repeatedly.

deitch commented 4 years ago

Well, if you are using the official demos, I cannot complain it is anything off with your manifests. :-)

I can give it a shot at recreating. But I see you have clusters up. Can I connect to them and look around? If so, send me a DM on Packet Slack (I am deitcher on there). If not, I can keep trying to recreate. Let me know.

deitch commented 4 years ago

Ah, I got it. Now I can see it. Interesting, will figure this one out.

zeman412 commented 4 years ago

@deitch sorry for the late reply, I moved to a different project and didn't get a chance to check this out. We teared down the k8s cluster, but I will get back to this task soon and I would love to hear if there is an update or fix regarding this issue. I also had an issue with detaching volume and deleting storage (I see there is a new issue opened for this). If I correctly remember, it also asks for manual verification to delete storage from the UI, which is inconvenient for automating the whole deployment process.

deitch commented 4 years ago

No problem.

We have two updates going in for the issue. The first relates to internals of the CSI itself. That will go through as soon as our CI (actually CD) is fixed. We are having some issues with cross-building the arm64 images.

The second relates to how the Packet API and its backing storage release volumes after the host iscsi is logged out. That is a bit thorny, but we will have something on that soon.

If I correctly remember, it also asks for manual verification to delete storage from the UI, which is inconvenient for automating the whole deployment process.

As far as I know, that is just for UI deletion. API-driven ones do not involve any manual process.

zeman412 commented 4 years ago

Sounds good, looking forward to the updates.

deitch commented 4 years ago

There are two updates in flight. One is #77 helps with part of this, and another, which will come right after #77 is in, will handle the delete issue.

deitch commented 4 years ago

Should be fixed in #79

equinixmetal-archive / csi-packet

volume mount failures due to "... 422 Instance is already attached to this volume" #66

kubectl create -f pvc-nginx.yaml

kubectl create -f nginx.yaml