digitalocean / csi-digitalocean

A Container Storage Interface (CSI) Driver for DigitalOcean Block Storage
Apache License 2.0
575 stars 106 forks source link

csi-do-controller-0 CrashLoopBackOff: couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json" #328

Open max3903 opened 4 years ago

max3903 commented 4 years ago

What did you do? (required. The issue will be closed when not provided.)

I followed the documentation to add the do-block-storage plugin:

I added the secret successfully and run:

kubectl apply -f https://raw.githubusercontent.com/digitalocean/csi-digitalocean/master/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml

It fails on some snapshot specific stuff:

CustomResourceDefinition.apiextensions.k8s.io "volumesnapshots.snapshot.storage.k8s.io" is invalid: spec.version: Invalid value: "v1alpha1": must match the first version in spec.versions

I moved on (I believe it is fixed by #322) and tried to create a PVC.

What did you expect to happen?

I was expecting the PV to be created.

Configuration (MUST fill this out):

https://gist.github.com/max3903/acb18527be1138a33d77f3eaaddb89a8

secret.yaml:

apiVersion: v1
kind: Secret
metadata:
  name: digitalocean
  namespace: kube-system
stringData:
  access-token: "3e8[...]ec5"

pvc.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: jenkins-data
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: do-block-storage

1.3.0

1.17

OKD 4.5

max3903 commented 4 years ago

Other information:

I am using OKD 4.5 on Fedora CoreOS 31.

The pod csi-do-controller-0 remains in status CrashLoopBackOff.

4 out of 5 containers are in state Running but have this error message in the log:

connection.go:170] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock

The last one csi-do-plugin (digitalocean/do-csi-plugin:v1.3.0) remains in state Waiting and the logs says:

couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json": dial tcp 169.254.169.254:80: connect: connection refused (are you running on DigitalOcean droplets?)

On the worker, the csi.sock is not in:

/var/lib/csi/sockets/pluginproxy/csi.sock

but in

/var/lib/kubelet/plugins/dobs.csi.digitalocean.com/csi.sock
timoreimann commented 4 years ago

Hi @max3903

the error

couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json": dial tcp 169.254.169.254:80: connect: connection refused (are you running on DigitalOcean droplets?)

is odd because it usually means that you are not running on DigitalOcean infrastructure (as the error indicates). However, I do see a DO region label on one of your Nodes. Can you confirm that you are indeed running on droplets? Can you connect to the metadata endpoint from your nodes?

What might also be good to know: did you try to apply the manifests on a cluster that had a previous version of the CSI driver installed already, or was this a first-time CSI installation attempt?

max3903 commented 4 years ago

Hello @timoreimann

Yes I am running on droplets built from a custom image: Fedora CoreOS 31 for Digital Ocean from https://getfedora.org/en/coreos/download?tab=cloud_operators&stream=stable

Yes, I can connect to the metadata endpoint from the 3 masters and 2 workers. That is actually how each droplet get their hostname during the installation: See https://github.com/coreos/fedora-coreos-tracker/issues/538

Yes, I tried to apply the manifests multiple time using different versions/urls. I tried 0.3.0 first, then latest and finally 1.3.0.

max3903 commented 4 years ago

So I ran:

oc delete -f https://raw.githubusercontent.com/digitalocean/csi-digitalocean/master/deploy/kubernetes/releases/csi-digitalocean-v0.3.0.yaml

oc apply -f https://raw.githubusercontent.com/digitalocean/csi-digitalocean/master/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml

I don't know if it helps but the container created from the DaemonSet is working fine on the same node.

Only the one created from the StatefulSet is crashing...

timoreimann commented 4 years ago

@max3903 CSI driver in version 0.3.0 definitely does not support Kubernetes 1.17. (See also our support matrix.) If you installed that first, the subsequent 1.3.0 installation most likely failed because of unsupported (and broken) left-overs from 0.3.0.

Can you try to install v1.3.0 from a clean slate, i.e., on a 1.17 cluster that does not come with any other (older) CSI driver versions installed beforehand?

max3903 commented 4 years ago

Even after running:

oc delete -f https://raw.githubusercontent.com/digitalocean/csi-digitalocean/master/deploy/kubernetes/releases/csi-digitalocean-v0.3.0.yaml

?

max3903 commented 4 years ago

@timoreimann Installing the cluster was a pretty painful process I would like to avoid.

I removed all the csi* images from all the masters and workers:

podman image rm docker.io/digitalocean/do-csi-plugin:v1.3.0
podman image rm docker.io/digitalocean/do-csi-plugin:dev
podman image rm quay.io/k8scsi/csi-node-driver-registrar:v1.1.0
podman image rm quay.io/k8scsi/csi-resizer:v0.3.0
podman image rm quay.io/k8scsi/csi-snapshotter:v1.2.2
podman image rm quay.io/k8scsi/csi-provisioner:v1.4.0
podman image rm quay.io/k8scsi/csi-attacher:v2.0.0

and installed the correct version (1.3.0). I still get the same error.

Which left-overs am I missing?

timoreimann commented 4 years ago

Check for any snapshot-related CRDs that might be remaining (kubectl get crd) and delete them.

max3903 commented 4 years ago

@timoreimann

I deleted them.

No errors when running:

kubectl apply -f https://raw.githubusercontent.com/digitalocean/csi-digitalocean/master/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml

Still the same behavior on the controller, i.e the pod csi-do-controller-0 remains in status CrashLoopBackOff.

4 out of 5 containers are in state Running but have this error message in the log:

connection.go:170] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock

The last one csi-do-plugin (digitalocean/do-csi-plugin:v1.3.0) remains in state Waiting and the logs says:

couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json": dial tcp 169.254.169.254:80: connect: connection refused (are you running on DigitalOcean droplets?)

If I replace the args at https://github.com/digitalocean/csi-digitalocean/blob/master/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml#L194 with:

          args :
            - "--version"

I get this message in the logs of the container:

latest - 59e354368961c4688243fc083c94b963c276e5b4 (clean)

I tried to run the container on the worker:

$ podman run digitalocean/do-csi-plugin:v1.3.0 \
    --endpoint=unix:///var/lib/csi/sockets/pluginproxy/csi.sock \
    --url=https://api.digitalocean.com/ 
    --token=3e8****ec5
time="2020-06-18T00:05:44Z" level=info msg="removing socket" host_id=196466821 region=sfo3 socket=/var/lib/csi/sockets/pluginproxy/csi.sock version=latest
2020/06/18 00:05:44 failed to listen: listen unix /var/lib/csi/sockets/pluginproxy/csi.sock: bind: no such file or directory

I also tried to use curl to create a volume through the API from the same node and it worked:

curl -X POST -H "Content-Type: application/json" \
    -H "Authorization: Bearer 3e8***ec5" \
    -d '{"size_gigabytes":10, "name": "example", "description": "Block store for examples", "region": "sfo3", "filesystem_type": "ext4", "filesystem_label": "example"}' \
    "https://api.digitalocean.com/v2/volumes"

The container from the same image on the same node from the DaemonSet is still working fine:

time="2020-06-17T23:00:21Z" level=info msg="removing socket" host_id=196466821 region=sfo3 socket=/csi/csi.sock version=latest
time="2020-06-17T23:00:21Z" level=info msg="starting server" grpc_addr=/csi/csi.sock host_id=196466821 http_addr= region=sfo3 version=latest
time="2020-06-17T23:00:22Z" level=info msg="get plugin info called" host_id=196466821 method=get_plugin_info region=sfo3 response="name:\"dobs.csi.digitalocean.com\" vendor_version:\"latest\" " version=latest
time="2020-06-17T23:00:23Z" level=info msg="node get info called" host_id=196466821 method=node_get_info region=sfo3 version=latest

FYI, all droplets are Fedora CoreOS 31 in SFO3 with this workaround to set the hostname: https://github.com/coreos/fedora-coreos-tracker/issues/538

lucab commented 4 years ago

The couldn't get metadata is likely a red-herring due to the manual podman run which is unlike the k8s manifest regarding network namespace setup.

max3903 commented 4 years ago

@timoreimann With @lucab and @dustymabe help, I got it working by adding:

      hostNetwork: true
      securityContext:
        privileged: true

in https://github.com/digitalocean/csi-digitalocean/blob/release-1.3/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml#L142

timoreimann commented 4 years ago

@max3903 glad you figured it out. 🎉
Do I understand correctly that you needed to add the hostNetwork / privileged fields to the Controller service? (We do have it set on the Node service in the manifest.)

FWIW, the manifest you referenced (and had to amend) is what we use for our end-to-end tests as-is: we deploy it into a DOKS cluster and run upstream e2e tests against. I'm confused why it didn't work for you -- wondering if there's perhaps something specific about OKD (or DOKS) that explains the difference in behavior?

max3903 commented 4 years ago

@timoreimann Yes on the controller.

@dustymabe mentioned that openshift has stricter security settings than base kubernetes.

dustymabe commented 4 years ago

@dustymabe mentioned that openshift has stricter security settings than base kubernetes.

Typically that is the case. Unfortunately I don't have enough expertise to know what those extra security defaults are or if that's the cause of the issues here. I just know enough to bring up that it could be the cause.

dustymabe commented 4 years ago

@timoreimann With @lucab and @dustymabe help, I got it working by adding:

      hostNetwork: true
      securityContext:
        privileged: true

in https://github.com/digitalocean/csi-digitalocean/blob/release-1.3/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml#L142

This seems to be working for me with just the hostNetwork: true change. I don't think privileged: true is needed.

timoreimann commented 4 years ago

Right, privileged mode should be needed on the Node service only to allow mount propagation. I don't think we have it set on our Controller service manifest.

timoreimann commented 4 years ago

If you'd like to submit a quick PR to document the need to run on host network in OKD (and perhaps leave a commented out hostNetwork: true field in the manifest), I'd be happy to review that.

dustymabe commented 4 years ago

Thanks @timoreimann. Do you think it would make sense to do it by default instead of having it commented out?

timoreimann commented 4 years ago

@dustymabe the only platform I'm aware of at this point that requires host networking to be enabled on the Controller service seems to be OKD. So I'm more inclined to keeping it commented out for now. If someone could manage to find out more specific reasons why it's needed in OKD though, we could possibly better judge if it's something that other platforms / systems may be affected by as well.

dustymabe commented 4 years ago

I changed the csi-do-plugin container within the pod to just sleep so I could exec in there and poke around.

/ # ip -4 -o a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
3: eth0    inet 10.129.0.61/23 brd 10.129.1.255 scope global eth0\       valid_lft forever preferred_lft forever
/ # busybox wget http://169.254.169.254/metadata/v1.json
Connecting to 169.254.169.254 (169.254.169.254:80)
wget: can't connect to remote host (169.254.169.254): Connection refused

It might be worth noting that OKD uses OVN networking: https://docs.openshift.com/container-platform/4.5/networking/ovn_kubernetes_network_provider/about-ovn-kubernetes.html. Unfortunately I don't know much about the networking side so I'm a bit limited in understanding this.

In order to workaround temporarily this patch command should work for users:

PATCH='                                                                       
spec:                                                                         
  template:                                                                   
    spec:                                                                     
      hostNetwork: true'                                                      
oc patch statefulset/csi-do-controller -n kube-system --type merge -p "$PATCH"

Can we change the title of this to csi-do-controller-0 CrashLoopBackOff: couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json" so others might be able to find it easier.

max3903 commented 4 years ago

@dustymabe Done!

grumps commented 4 years ago

:wave: So I've run into this issue as well using K3s on DO. I was able to finally get things running with hostnetwork: true. I'm using the default network driver of flannel but it does use containerd as the runtime.

dustymabe commented 1 year ago

I can confirm that the workaround in https://github.com/digitalocean/csi-digitalocean/issues/328#issuecomment-666513011 still works for me today.