equinixmetal-archive / csi-packet

Kubernetes CSI driver for Equnix Metal, formerly Packet
Apache License 2.0
25 stars 13 forks source link

volume mount failures due to "... 422 Instance is already attached to this volume" #66

Closed zeman412 closed 4 years ago

zeman412 commented 4 years ago

After installing the csi driver, I was able to create the volume claim and it worked fine for the first time. However, the problem happens when deleting deployment and trying to mount the existing volume claim.

root@ewr1-controller:~/packet_taurus# kubectl get pvc
No resources found.
root@ewr1-controller:~/packet_taurus# kubectl create -f pvc-mysql.yaml 
persistentvolumeclaim/mysql-volumeclaim created
root@ewr1-controller:~/packet_taurus# kubectl get pvc
NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS          AGE
mysql-volumeclaim   Bound    pvc-6a577e83-ce99-4862-9947-5a33509464f2   10Gi       RWO            csi-packet-standard   6s
root@ewr1-controller:~/packet_taurus# kubectl create -f mysql.yaml 
service/magento-mysql created
deployment.apps/magento-mysql created
root@ewr1-controller:~/packet_taurus# kubectl get pods
NAME                           READY   STATUS    RESTARTS   AGE
magento-mysql-fbd8dbc6-rsxfp   1/1     Running   0          64s
root@ewr1-controller:~/packet_taurus# kubectl delete -f mysql.yaml 
service "magento-mysql" deleted
deployment.apps "magento-mysql" deleted
root@ewr1-controller:~/packet_taurus# kubectl get pvc
NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS          AGE
mysql-volumeclaim   Bound    pvc-6a577e83-ce99-4862-9947-5a33509464f2   10Gi       RWO            csi-packet-standard   3m33s
root@ewr1-controller:~/packet_taurus# kubectl create -f mysql.yaml 
service/magento-mysql created
deployment.apps/magento-mysql created
root@ewr1-controller:~/packet_taurus# kubectl get pods
NAME                           READY   STATUS              RESTARTS   AGE
magento-mysql-fbd8dbc6-kll6f   0/1     ContainerCreating   0          80s
root@ewr1-controller:~/packet_taurus# 

Mysql deployment was running fine, then I deleted the deployment and recreated it and as shown above myql fails to mount the pvc this time.

$ kubectl describe pod magento-mysql-fbd8dbc6-kll6f
.
.
.
Events:
  Type     Reason              Age                   From                                 Message
  ----     ------              ----                  ----                                 -------
  Normal   Scheduled           8m34s                 default-scheduler                    Successfully assigned default/magento-mysql-fbd8dbc6-kll6f to ewr1-t1.small.x86-worker-1
  Warning  FailedMount         119s (x3 over 6m31s)  kubelet, ewr1-t1.small.x86-worker-1  Unable to mount volumes for pod "magento-mysql-fbd8dbc6-kll6f_default(d8127740-869e-45a8-b6c2-18f76401aa56)": timeout expired waiting for volumes to attach or mount for pod "default"/"magento-mysql-fbd8dbc6-kll6f". list of unmounted volumes=[mysql-persistent-storage]. list of unattached volumes=[mysql-persistent-storage mysql-config-volume default-token-blnv7]
  Warning  FailedAttachVolume  18s (x12 over 8m32s)  attachdetach-controller              AttachVolume.Attach failed for volume "pvc-6a577e83-ce99-4862-9947-5a33509464f2" : rpc error: code = Unknown desc = error attempting to attach cfdeaa23-f9b3-4ec0-a324-29185b1546dd to 167e1997-0802-449e-97e0-fbbf4a6d7e2b, POST https://api.packet.net/storage/cfdeaa23-f9b3-4ec0-a324-29185b1546dd/attachments: 422 Instance is already attached to this volume

I followed the README for packet csi and created the pvc exactly as the example provided.

deitch commented 4 years ago

Thanks @zeman412 for reporting it.

We will try to recreate. At first blush, it sounds like an implementation bug rather than a documentation issue. Either way, we will get it and update here.

deitch commented 4 years ago

@zeman412 can you post your pvc and pod or deployment yaml?

deitch commented 4 years ago

@zeman412 I was not able to recreate it. I followed your steps:

  1. Create the pvc
  2. Create the deployment
  3. Wait for success
  4. Delete the deployment
  5. Recreate the deployment

But it succeeded in reattaching. Do you have nodes up and running that we can see the failure?

zeman412 commented 4 years ago

@deitch yes those are the steps that raise the problem and I just reproduced the error with the nginx deployment example included in the csi-packet/deploy/demo/demo-deployment.yaml.. And I observed that the problem may not always occur on the first attempt, but if I repeat the procedure for the 2nd/3rd times then the error occurs. Here is the details:

  1. pvc-nginx.yaml
    kind: PersistentVolumeClaim
    apiVersion: v1
    metadata:
    name: podpvc
    spec:
    accessModes:
    - ReadWriteOnce
    storageClassName: csi-packet-standard
    resources:
    requests:
      storage: 1Gi
  2. nginx.yaml
    
    kind: Deployment
    apiVersion: apps/v1
    metadata:
    labels:
    run: nginx
    name: nginx
    spec:
    replicas: 1
    selector:
    matchLabels:
      run: nginx
    strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
    template:
    metadata:
      name: web-server
      labels:
        run: nginx
    spec:
      # nodeSelector:
      #   kubernetes.io/hostname: "10.88.52.141"
      containers:
      - image: nginx
        name: nginx
        volumeMounts:
          - mountPath: /var/lib/www/html
            name: mypvc
      volumes:
      - name: mypvc
        persistentVolumeClaim:
          claimName: podpvc
          readOnly: false
Then created pvc and deployment:
  1. kubectl create -f pvc-nginx.yaml

    persistentvolumeclaim/podpvc created

  2. kubectl create -f nginx.yaml

    deployment.apps/nginx created

  3. Wait for success:
    kubectl get pods
    NAME                     READY   STATUS    RESTARTS   AGE
    nginx-7f545b55b8-slj2r   1/1     Running   0          59s 
  4. Delete the deployment:
    kubectl delete -f nginx.yaml 
    deployment.apps "nginx" deleted
    # kubectl get pods
    No resources found.
  5. Recreate the deployment: 5.1: First attempt:
    # kubectl create -f nginx.yaml 
    deployment.apps/nginx created
    root@ewr1-controller:~/csi-packet# kubectl get pods
    NAME                     READY   STATUS    RESTARTS   AGE
    nginx-7f545b55b8-mms5f   1/1     Running   0          41s

    5.2: Second attempt:

    # kubectl delete -f nginx.yaml 
    deployment.apps "nginx" deleted
    # kubectl get pods
    No resources found.

    Now recreate the deployment:

    # kubectl create -f nginx.yaml 
    deployment.apps/nginx created
    # kubectl get pods
    NAME                     READY   STATUS              RESTARTS   AGE
    nginx-7f545b55b8-qwqmr   0/1     ContainerCreating   0          17m

    It got stuck there:

    # kubectl describe pods nginx-7f545b55b8-qwqmr
    Name:           nginx-7f545b55b8-qwqmr
    Namespace:      default
    Priority:       0
    Node:           ewr1-m2.xlarge.x86-worker-5/10.99.142.13
    Start Time:     Mon, 25 Nov 2019 15:27:12 +0000
    Labels:         pod-template-hash=7f545b55b8
                run=nginx
    Annotations:    <none>
    Status:         Pending
    IP:             
    Controlled By:  ReplicaSet/nginx-7f545b55b8
    Containers:
    nginx:
    Container ID:   
    Image:          nginx
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/www/html from mypvc (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-8gj6m (ro)
    Conditions:
    Type              Status
    Initialized       True 
    Ready             False 
    ContainersReady   False 
    PodScheduled      True 
    Volumes:
    mypvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  podpvc
    ReadOnly:   false
    default-token-8gj6m:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-8gj6m
    Optional:    false
    QoS Class:       BestEffort
    Node-Selectors:  <none>
    Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
    Events:
    Type     Reason              Age                 From                                  Message
    ----     ------              ----                ----                                  -------
    Normal   Scheduled           18m                 default-scheduler                     Successfully assigned default/nginx-7f545b55b8-qwqmr to ewr1-m2.xlarge.x86-worker-5
    Warning  FailedAttachVolume  18m                 attachdetach-controller               Multi-Attach error for volume "pvc-16b22d0e-2151-4f81-905b-a8163552d2eb" Volume is already exclusively attached to one node and can't be attached to another
    Warning  FailedMount         64s (x8 over 16m)   kubelet, ewr1-m2.xlarge.x86-worker-5  Unable to mount volumes for pod "nginx-7f545b55b8-qwqmr_default(53a82355-4763-4318-b4bd-68c2ab7a8990)": timeout expired waiting for volumes to attach or mount for pod "default"/"nginx-7f545b55b8-qwqmr". list of unmounted volumes=[mypvc]. list of unattached volumes=[mypvc default-token-8gj6m]
    Warning  FailedAttachVolume  26s (x17 over 18m)  attachdetach-controller               AttachVolume.Attach failed for volume "pvc-16b22d0e-2151-4f81-905b-a8163552d2eb" : rpc error: code = Unknown desc = error attempting to attach b1b3f63d-526a-4615-8a90-b109ecfff21f to f7c652f0-30d3-471b-89b7-5380c4d6bf27, POST https://api.packet.net/storage/b1b3f63d-526a-4615-8a90-b109ecfff21f/attachments: 422 Instance is already attached to this volume

    We are working on performance tuning and automating the whole process, thus, I had to recreate the deployment repeatedly.

deitch commented 4 years ago

Well, if you are using the official demos, I cannot complain it is anything off with your manifests. :-)

I can give it a shot at recreating. But I see you have clusters up. Can I connect to them and look around? If so, send me a DM on Packet Slack (I am deitcher on there). If not, I can keep trying to recreate. Let me know.

deitch commented 4 years ago

Ah, I got it. Now I can see it. Interesting, will figure this one out.

zeman412 commented 4 years ago

@deitch sorry for the late reply, I moved to a different project and didn't get a chance to check this out. We teared down the k8s cluster, but I will get back to this task soon and I would love to hear if there is an update or fix regarding this issue. I also had an issue with detaching volume and deleting storage (I see there is a new issue opened for this). If I correctly remember, it also asks for manual verification to delete storage from the UI, which is inconvenient for automating the whole deployment process.

deitch commented 4 years ago

No problem.

We have two updates going in for the issue. The first relates to internals of the CSI itself. That will go through as soon as our CI (actually CD) is fixed. We are having some issues with cross-building the arm64 images.

The second relates to how the Packet API and its backing storage release volumes after the host iscsi is logged out. That is a bit thorny, but we will have something on that soon.

If I correctly remember, it also asks for manual verification to delete storage from the UI, which is inconvenient for automating the whole deployment process.

As far as I know, that is just for UI deletion. API-driven ones do not involve any manual process.

zeman412 commented 4 years ago

Sounds good, looking forward to the updates.

deitch commented 4 years ago

There are two updates in flight. One is #77 helps with part of this, and another, which will come right after #77 is in, will handle the delete issue.

deitch commented 4 years ago

Should be fixed in #79