cormachogan / vsphere-csi-helmchart

Helm Chart for vSphere CSI Driver
9 stars 5 forks source link

Issues with startup: vsphere-csi-controller 5/6 Error - csi-resizer crashing #6

Closed TJM closed 4 years ago

TJM commented 4 years ago

I tried to install this and got...

[tmcneely@DVA-C02ZG12HLVDV vsphere-csi-helmchart] (⎈ |admin@den3test:vsphere)$ k get po
NAME                                      READY   STATUS              RESTARTS   AGE
vsphere-cpi-4cfzk                         0/1     ContainerCreating   0          3m7s
vsphere-cpi-d5vl5                         0/1     ContainerCreating   0          3m7s
vsphere-cpi-xl8bj                         0/1     ContainerCreating   0          3m7s
vsphere-csi-controller-55d85c9cd4-5vmqc   5/6     Error               5          3m7s
vsphere-csi-node-6f5dd                    3/3     Running             0          3m7s
vsphere-csi-node-874h8                    3/3     Running             0          3m7s
vsphere-csi-node-fcjsc                    3/3     Running             0          3m7s
vsphere-csi-node-fhvxm                    3/3     Running             0          3m7s
vsphere-csi-node-s577v                    3/3     Running             0          3m7s
vsphere-csi-node-t7nrw                    3/3     Running             0          3m7s

The vsphere-cpi pods are stuck on:

Events:
  Type     Reason       Age                  From                                Message
  ----     ------       ----                 ----                                -------
  Normal   Scheduled    2m58s                default-scheduler                   Successfully assigned vsphere/vsphere-cpi-4cfzk to den3l6kubem02.davita.corp
  Warning  FailedMount  55s                  kubelet, den3l6kubem02.davita.corp  Unable to attach or mount volumes: unmounted volumes=[vsphere-config-volume], unattached volumes=[vsphere-config-volume cloud-controller-manager-token-rqf9n]: timed out waiting for the condition
  Warning  FailedMount  50s (x9 over 2m58s)  kubelet, den3l6kubem02.davita.corp  MountVolume.SetUp failed for volume "vsphere-config-volume" : configmap "cloud-config" not found

Apparently we need a configMap named cloud-config? Is that something missing from the helm chart or? I have a guess that this is somewhat the same contents as the secret/vsphere-config-secret?


The csi-controller is stuck on CrashBackoffLoop on csi-resizer:

[tmcneely@DVA-C02ZG12HLVDV vsphere-csi-helmchart] (⎈ |admin@den3test:vsphere)$ k logs vsphere-csi-controller-55d85c9cd4-5vmqc csi-resizer
I0911 00:04:17.062052       1 main.go:61] Version : v0.3.0-0-g150071d
I0911 00:04:17.063050       1 connection.go:151] Connecting to unix:///csi/csi.sock
I0911 00:04:17.063663       1 common.go:111] Probing CSI driver for readiness
F0911 00:04:17.066109       1 main.go:72] failed to check if plugin supports node resize: error getting node capabilities: rpc error: code = Unimplemented desc = unknown service csi.v1.Node
cormachogan commented 4 years ago

Yes - we placed a link to the CCM/CPI chart in this helm chart. Both the cloud controller manager have their own configurations - we are not able to share them between the controllers at this time, even though much of the info is similar. From the Pod listing, it looks like CCM/CPI has not started successfully, which I guess is why it cannot find the cloud-config configmap for the cloud-controller-manager. This could be due to deploying in a different namespace to kube-system. Again, I'll try to find out.

mylesagray commented 4 years ago

@TJM - we've added the docs to set the namespace for the CPI - try that and see if you have any luck with your PR #2

TJM commented 4 years ago

My first problem was the vsphere-cpi.config.enabled ... what is the deal with that? if the configuration is required to startup, why would it be disabled by default? At least vsphere-csi's default value is true :) (shrug)

For what its worth, it doesn't look like that version (0.1.3) of vsphere-cpi was published properly... not sure if that is by design?

On a positive note, it looks like the vsphere-cpi 0.1.3 helm chart uses {{ Release.Namespace }} which seems much more correct to me than setting it in Values. I would suggest that the csi chart is adjusted to that instead?

Now, it looks like I have a rolebinding issue, which is probably related to the namespace (since rolebindings tend to be tied to a namespace). On a positive note the error message in the go code indicates the problem, so I can probably fix that :)

TJM commented 4 years ago

NOTE: I checked on the role binding thing, and even in the most recent version (0.2.0) the problem was still there, so I filed https://github.com/helm/charts/issues/23765

TJM commented 4 years ago

So... I have tried restarting the vsphere-csi controller pod several times, and it is still erroring on the csi-resizer container:

[tmcneely@DVA-C02ZG12HLVDV vsphere-cpi] $ k logs vsphere-csi-controller-577f5c7468-zn2qb csi-resizer
I0914 16:41:02.584012       1 main.go:61] Version : v0.3.0-0-g150071d
I0914 16:41:02.585322       1 connection.go:151] Connecting to unix:///csi/csi.sock
I0914 16:41:02.586176       1 common.go:111] Probing CSI driver for readiness
F0914 16:41:02.588415       1 main.go:72] failed to check if plugin supports node resize: error getting node capabilities: rpc error: code = Unimplemented desc = unknown service csi.v1.Node

... same error as before, so the problem is apparently not that the vsphere-cpi pods were not ready.

[tmcneely@DVA-C02ZG12HLVDV vsphere-cpi] $ k get po
NAME                                      READY   STATUS             RESTARTS   AGE
vsphere-cpi-2s6wc                         1/1     Running            0          30m
vsphere-cpi-998d2                         1/1     Running            0          30m
vsphere-cpi-crd8v                         1/1     Running            0          30m
vsphere-csi-controller-577f5c7468-zn2qb   5/6     CrashLoopBackOff   5          3m53s
vsphere-csi-node-4rknj                    3/3     Running            0          62m
vsphere-csi-node-8fbnn                    3/3     Running            0          62m
vsphere-csi-node-8rt9g                    3/3     Running            0          62m
vsphere-csi-node-hkwzf                    3/3     Running            0          62m
vsphere-csi-node-kcjqx                    3/3     Running            0          62m
vsphere-csi-node-lxlbb                    3/3     Running            0          62m
cormachogan commented 4 years ago

Is this vSphere 6.7U3 or vSphere 7.0? The CSI resizer is only supported (beta) in 7.0. I ask, as it sounds like the issue reported here - https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/250

TJM commented 4 years ago

Good catch! We are definitely on 6.7 (probably U3) (16046713).

Should we try to do some sort of version parameter, and exclude that container if its less than 7? or is that a bug somewhere else?

TJM commented 4 years ago

See #8 for initial attempt at a fix... of course I couldn't just add a "disable" for this one container, I had to go and try to make all the images variables... and I am having issues.

cormachogan commented 4 years ago

Good catch! We are definitely on 6.7 (probably U3) (16046713).

Should we try to do some sort of version parameter, and exclude that container if its less than 7? or is that a bug somewhere else?

Probably something that should be fixed directly in the vSphere CSI driver in my opinion. Again, we don't really want the helm chart to do anything that forks it from the behaviour of the base vSphere CSI driver.

TJM commented 4 years ago

Do we expect them to log a message and cleanly exit? (as opposed to exiting as failure)

Perhaps they can do some sort of version query at startup?

If I was the developers of the csi-resizer, I would suggest that we should just not start that container on vsphere versions less than 7.0 (which is what I am planning to try to do).

According to the bug report listed above the csi-resizer was "missing" from their manual installation steps, so perhaps leaving it out is still the best options from the helmchart perspective.

Tommy

cormachogan commented 4 years ago

Not sure Tommy - please go ahead with what you are planning for the helm chart. Thinking about it, this can only be a good addition to have, even if it is not aligned with manual CSI driver install behaviour.

TJM commented 4 years ago
[tmcneely@DVA-C02ZG12HLVDV vsphere-csi-helmchart] (⎈ |admin@den3test:vsphere)$ k get po
NAME                                     READY   STATUS    RESTARTS   AGE
vsphere-cpi-7572g                        1/1     Running   0          22m
vsphere-cpi-vvlw8                        1/1     Running   0          22m
vsphere-cpi-wkxfp                        1/1     Running   0          22m
vsphere-csi-controller-fbf8bbc5c-l6nzq   5/5     Running   0          55s
vsphere-csi-node-2kbn2                   3/3     Running   0          9m24s
vsphere-csi-node-4jgkx                   3/3     Running   0          9m15s
vsphere-csi-node-hkckd                   3/3     Running   0          9m1s
vsphere-csi-node-rwc87                   3/3     Running   0          9m46s
vsphere-csi-node-t57qw                   3/3     Running   0          9m38s
vsphere-csi-node-xx6l8                   3/3     Running   0          9m50s
TJM commented 4 years ago

Fixed by #8