digitalocean / csi-digitalocean

A Container Storage Interface (CSI) Driver for DigitalOcean Block Storage
Apache License 2.0
577 stars 107 forks source link

Can not mount volume #242

Closed Paxa closed 4 years ago

Paxa commented 4 years ago

All PVC works fine, except 1 pod

K8s v1.16.2

descrive pod:

  Warning  FailedMount  31s (x2 over 5m)  kubelet, myapp-abcd  Unable to attach or mount volumes: unmounted volumes=[postgres-persistent-storage3], unattached volumes=[postgres-persistent-storage3 default-token-27frr]: timed out waiting for the condition

From syslog:

cat /var/log/syslog | grep postgres-dcb5fbfd-dcx5x

Dec 11 17:26:29 garuda-staging-l03r docker[1929]: E1211 17:26:29.464844    1955 kubelet.go:1682] Unable to attach or mount volumes for pod "postgres-dcb5fbfd-pc95m_ga-monitor(97c8036c-4e56-446b-bcf7-c70616950216)": unmounted volumes=[postgres-persistent-storage3], unattached volumes=[default-token-27frr postgres-persistent-storage3]: timed out waiting for the condition; skipping pod
Dec 11 17:26:29 garuda-staging-l03r docker[1929]: E1211 17:26:29.464908    1955 pod_workers.go:191] Error syncing pod 97c8036c-4e56-446b-bcf7-c70616950216 ("postgres-dcb5fbfd-pc95m_ga-monitor(97c8036c-4e56-446b-bcf7-c70616950216)"), skipping: unmounted volumes=[postgres-persistent-storage3], unattached volumes=[default-token-27frr postgres-persistent-storage3]: timed out waiting for the condition
Dec 11 17:27:45 garuda-staging-l03r docker[1929]: I1211 17:27:45.527433    1955 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-a34c150a-70da-4daf-93c0-708f53c749c0" (UniqueName: "kubernetes.io/csi/dobs.csi.digitalocean.com^3cb2ce70-1a04-11ea-8090-0a58ac14a16c") pod "postgres-dcb5fbfd-pc95m" (UID: "97c8036c-4e56-446b-bcf7-c70616950216")
Dec 11 17:27:45 garuda-staging-l03r docker[1929]: E1211 17:27:45.532232    1955 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/dobs.csi.digitalocean.com^3cb2ce70-1a04-11ea-8090-0a58ac14a16c\"" failed. No retries permitted until 2019-12-11 17:29:47.532186007 +0000 UTC m=+626.528731112 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-a34c150a-70da-4daf-93c0-708f53c749c0\" (UniqueName: \"kubernetes.io/csi/dobs.csi.digitalocean.com^3cb2ce70-1a04-11ea-8090-0a58ac14a16c\") pod \"postgres-dcb5fbfd-pc95m\" (UID: \"97c8036c-4e56-446b-bcf7-c70616950216\") "
Dec 11 17:28:46 garuda-staging-l03r docker[1929]: E1211 17:28:46.462539    1955 kubelet.go:1682] Unable to attach or mount volumes for pod "postgres-dcb5fbfd-pc95m_ga-monitor(97c8036c-4e56-446b-bcf7-c70616950216)": unmounted volumes=[postgres-persistent-storage3], unattached volumes=[postgres-persistent-storage3 default-token-27frr]: timed out waiting for the condition; skipping pod
Dec 11 17:28:46 garuda-staging-l03r docker[1929]: E1211 17:28:46.462597    1955 pod_workers.go:191] Error syncing pod 97c8036c-4e56-446b-bcf7-c70616950216 ("postgres-dcb5fbfd-pc95m_ga-monitor(97c8036c-4e56-446b-bcf7-c70616950216)"), skipping: unmounted volumes=[postgres-persistent-storage3], unattached volumes=[postgres-persistent-storage3 default-token-27frr]: timed out waiting for the condition
Dec 11 17:29:47 garuda-staging-l03r docker[1929]: I1211 17:29:47.540951    1955 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-a34c150a-70da-4daf-93c0-708f53c749c0" (UniqueName: "kubernetes.io/csi/dobs.csi.digitalocean.com^3cb2ce70-1a04-11ea-8090-0a58ac14a16c") pod "postgres-dcb5fbfd-pc95m" (UID: "97c8036c-4e56-446b-bcf7-c70616950216")
Dec 11 17:29:47 garuda-staging-l03r docker[1929]: E1211 17:29:47.546452    1955 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/dobs.csi.digitalocean.com^3cb2ce70-1a04-11ea-8090-0a58ac14a16c\"" failed. No retries permitted until 2019-12-11 17:31:49.546422245 +0000 UTC m=+748.542967320 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-a34c150a-70da-4daf-93c0-708f53c749c0\" (UniqueName: \"kubernetes.io/csi/dobs.csi.digitalocean.com^3cb2ce70-1a04-11ea-8090-0a58ac14a16c\") pod \"postgres-dcb5fbfd-pc95m\" (UID: \"97c8036c-4e56-446b-bcf7-c70616950216\") "
...

Tried to restart csi pod, kubelet, docker - no luck

timoreimann commented 4 years ago

👋

Which version of CSI are you running?

Also, are you able to reconstruct what kind of operations / actions led to this state? Specifically, did you possibly force-delete a Kubernetes object at any point?

timoreimann commented 4 years ago

A common error is that a volume is already used by another pod running on a different node. DigitalOcean Block Storage supports RWO only, so any pod trying to reference a PVC already attached on a different node is bound to hang indefinitely.

Paxa commented 4 years ago

I attached 2 volumes to pod (move database to bigger volume) and after copying content I updated deployment to use only new pvc, then pod took too long to terminate so I force-delete it.

My cluster have only 1 node

Image: quay.io/k8scsi/csi-node-driver-registrar:v1.2.0 and digitalocean/do-csi-plugin:v1.1.2

I tried to mount volume manually to VM via DO dashboard and it worked

Paxa commented 4 years ago

Having same issue again, can not terminate pod:

csi driver logs

time="2019-12-14T17:55:29Z" level=info msg="target path is already unmounted" method=node_unpublish_volume node_id=170430134 region=sgp1 target_path="/var/lib/kubelet/pods/ae47eda8-5c18-4572-ba2f-3a1bc8f2b277/volumes/kubernetes.io~csi/pvc-2864e37b-5029-48e1-a86e-ec3b7819104d/mount" version=v1.1.2 volume_id=17e306bd-1912-11ea-bc02-0a58ac14a1f9
time="2019-12-14T17:55:29Z" level=info msg="unmounting volume is finished" method=node_unpublish_volume node_id=170430134 region=sgp1 target_path="/var/lib/kubelet/pods/ae47eda8-5c18-4572-ba2f-3a1bc8f2b277/volumes/kubernetes.io~csi/pvc-2864e37b-5029-48e1-a86e-ec3b7819104d/mount" version=v1.1.2 volume_id=17e306bd-1912-11ea-bc02-0a58ac14a1f9
time="2019-12-14T17:57:31Z" level=info msg="node unpublish volume called" method=node_unpublish_volume node_id=170430134 region=sgp1 target_path="/var/lib/kubelet/pods/ae47eda8-5c18-4572-ba2f-3a1bc8f2b277/volumes/kubernetes.io~csi/pvc-2864e37b-5029-48e1-a86e-ec3b7819104d/mount" version=v1.1.2 volume_id=17e306bd-1912-11ea-bc02-0a58ac14a1f9
time="2019-12-14T17:57:31Z" level=info msg="checking if target is mounted" args="[-o TARGET,PROPAGATION,FSTYPE,OPTIONS -M /var/lib/kubelet/pods/ae47eda8-5c18-4572-ba2f-3a1bc8f2b277/volumes/kubernetes.io~csi/pvc-2864e37b-5029-48e1-a86e-ec3b7819104d/mount -J]" cmd=findmnt node_id=170430134 region=sgp1 version=v1.1.2
time="2019-12-14T17:57:31Z" level=info msg="target path is already unmounted" method=node_unpublish_volume node_id=170430134 region=sgp1 target_path="/var/lib/kubelet/pods/ae47eda8-5c18-4572-ba2f-3a1bc8f2b277/volumes/kubernetes.io~csi/pvc-2864e37b-5029-48e1-a86e-ec3b7819104d/mount" version=v1.1.2 volume_id=17e306bd-1912-11ea-bc02-0a58ac14a1f9
time="2019-12-14T17:57:31Z" level=info msg="unmounting volume is finished" method=node_unpublish_volume node_id=170430134 region=sgp1 target_path="/var/lib/kubelet/pods/ae47eda8-5c18-4572-ba2f-3a1bc8f2b277/volumes/kubernetes.io~csi/pvc-2864e37b-5029-48e1-a86e-ec3b7819104d/mount" version=v1.1.2 volume_id=17e306bd-1912-11ea-bc02-0a58ac14a1f9
time="2019-12-14T17:59:33Z" level=info msg="node unpublish volume called" method=node_unpublish_volume node_id=170430134 region=sgp1 target_path="/var/lib/kubelet/pods/ae47eda8-5c18-4572-ba2f-3a1bc8f2b277/volumes/kubernetes.io~csi/pvc-2864e37b-5029-48e1-a86e-ec3b7819104d/mount" version=v1.1.2 volume_id=17e306bd-1912-11ea-bc02-0a58ac14a1f9
time="2019-12-14T17:59:33Z" level=info msg="checking if target is mounted" args="[-o TARGET,PROPAGATION,FSTYPE,OPTIONS -M /var/lib/kubelet/pods/ae47eda8-5c18-4572-ba2f-3a1bc8f2b277/volumes/kubernetes.io~csi/pvc-2864e37b-5029-48e1-a86e-ec3b7819104d/mount -J]" cmd=findmnt node_id=170430134 region=sgp1 version=v1.1.2
time="2019-12-14T17:59:33Z" level=info msg="target path is already unmounted" method=node_unpublish_volume node_id=170430134 region=sgp1 target_path="/var/lib/kubelet/pods/ae47eda8-5c18-4572-ba2f-3a1bc8f2b277/volumes/kubernetes.io~csi/pvc-2864e37b-5029-48e1-a86e-ec3b7819104d/mount" version=v1.1.2 volume_id=17e306bd-1912-11ea-bc02-0a58ac14a1f9
time="2019-12-14T17:59:33Z" level=info msg="unmounting volume is finished" method=node_unpublish_volume node_id=170430134 region=sgp1 target_path="/var/lib/kubelet/pods/ae47eda8-5c18-4572-ba2f-3a1bc8f2b277/volumes/kubernetes.io~csi/pvc-2864e37b-5029-48e1-a86e-ec3b7819104d/mount" version=v1.1.2 volume_id=17e306bd-1912-11ea-bc02-0a58ac14a1f9

describe pod:

Events:
  Type     Reason     Age   From                          Message
  ----     ------     ----  ----                          -------
  Normal   Killing    12m   kubelet, garuda-staging-l03r  Stopping container rabbitmq
  Warning  Unhealthy  12m   kubelet, garuda-staging-l03r  Readiness probe failed:
  Warning  Unhealthy  12m   kubelet, garuda-staging-l03r  Readiness probe failed: cannot exec in a stopped state: unknown

After restarting csi pod, it generates same logs

timoreimann commented 4 years ago

I'd guess the volume unmount request happens repeatedly as part of a larger control flow that's failing.

Could you share the logs from the sidecars as well as the driver controller part? (If this is a DOKS cluster, then you can also mail or Slack-DM me your cluster ID.)

timoreimann commented 4 years ago

@Paxa I discovered that our CSI driver misses to return a proper error code in one case which I could also observe in your issue. A fixing PR has been filed and is going to ship with the next release of our driver.

Not entirely sure though if that's the sole reason for your issues; let's continue to investigate.

tobinski commented 4 years ago

We have a similar error. If we move pods to a different node we often see Unable to attach or mount volumes.

timoreimann commented 4 years ago

@tobinski would you mind sharing the same information as https://github.com/digitalocean/csi-digitalocean/issues/242#issuecomment-565755964? Thanks.

Paxa commented 4 years ago

Do you have images for master? I could try and see if problem is solved

timoreimann commented 4 years ago

@Paxa unfortunately, what I thought to be the issue turned out to be an ambiguous interpretation of what the CSI specification expects: the presumed fix should not be related to what was observed on your cluster. (For posterity: #246 reverted the change I did again as it's not clearly needed.)

If you could respond to the email I sent you a couple of days ago, that'd be great as it'd allow us to further investigate the issue in your case.

timoreimann commented 4 years ago

@tobinski's issue seems slightly different: it looks like the problem that #221 addressed, which is going to be included in our next CSI driver release.

timoreimann commented 4 years ago

Issues were resolved externally. Closing.