Kubecf Helm install on Openshift 4.7 hangs on pod initialization for "cf-apps-dns" deployment

keunlee commented 3 years ago

Describe the bug

Deployment of KubeCF helm chart onto Openshift 4.7.0 results in an incomplete installation which hangs on initialization for "cf-apps-dns" deployment pod

To Reproduce

1) obtain the latest bundle (v2.7.12) and decompress:

wget https://github.com/cloudfoundry-incubator/kubecf/releases/download/v2.7.12/kubecf-bundle-v2.7.12.tgz

tar zxvf kubecf-bundle-v2.7.12.tgz

2) create "cf-operator" project/namespace

kubectl create namespace cf-operator

3) install "cf-operator/quarks" helm chart from release bundle:

helm install cf-operator --namespace cf-operator --set "global.singleNamespace.name=kubecf" cf-operator.tgz --wait

4) install "kubecf" helm chart from release bundle:

helm install kubecf --namespace kubecf --set "system_domain=kubecf.mydomain.local" kubecf_release.tgz

results in the cf-app-dns deployment having a pod stuck indefinitely initializing.

kubectl -n kubecf get po

NAME                           READY   STATUS     RESTARTS   AGE
cf-apps-dns-59f9f659f5-jxksd   0/1     Init:0/1   0          3m9s

if we observe the course of events after the KubeCF chart installation:

kubectl -n kubecf get events --sort-by=.metadata.creationTimestamp

the first warning encountered in the event stream reads:

TYPE      REASON                    OBJECT                                    MESSAGE
Warning   FailedMount               pod/cf-apps-dns-59f9f659f5-t5hwp          MountVolume.SetUp failed for volume "client-tls" : secret "var-cf-app-sd-client-tls" not found

Expected behavior

The cf-apps-dns deployment should continue, and be able to find all it's secrets (i.e. var-cf-app-sd-client-tls)

The cf-apps-dns deployment should not have it's corresponding pod hang on initialization

Environment

KubeCF Bundle v2.7.12
Red Hat Openshift 4.7.0
- Deployed on Red Hat Virtualization 4.4.3.12-0.1.el8ev
- Default StorageClass: ovirt-csi-sc
- 3 master nodes, 3 worker nodes - k8s version: v1.20.0+ba45583

Additional context

Leveraged installation instructions as outlined here: https://kubecf.io/docs/deployment/kubernetes-deploy/
Using default values except the system_domain, which is specified
Perhaps, this might be the root of the issue - not filling out the "necessary" details in values.yaml. As to which details are absolutely required/necessary, to minimally get an installation running on Openshift, this does not immediately appear to be clear to me.
Event Log Attachment under kubecf namespace - events.txt

keunlee commented 3 years ago

an update on my latest findings as far as my own investigation goes:

1) first thing's first, make sure to observe the logs of the operator!

export OPERATOR_POD=$(kubectl get pods -l name=cf-operator --namespace cf-operator --output name)

kubectl -n cf-operator logs $OPERATOR_POD -f

2) in regards to the issue that I was facing, the logs gave one particular tell-tale clue:

2021-03-07T21:24:02.211Z        ERROR   controller-runtime.manager.controller.boshdeployment-controller controller/controller.go:252    Reconciler error  {"name": "kubecf", "namespace": "kubecf", "error": "failed to create quarks secrets for BOSH manifest 'kubecf/kubecf': creating or updating QuarksSecret 'kubecf/var-blobstore-admin-users-password': quarkssecrets.quarks.cloudfoundry.org \"var-blobstore-admin-users-password\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"}

that means that the role which governs the crd quarkssecrets.quarks.cloudfoundry.org, is missing a permission for creating/updating finalizers

How to get around this?

I took a brute force "sledgehammer" approach, and assigned the following rules to the following roles:

rule

rules:
  - verbs:
      - '*'
    apiGroups:
      - '*'
    resources:
      - '*'

roles

cf-operator-quarks-cluster
cf-operator-quarks-job
cf-operator-quarks-secret
cf-operator-quarks-statefulset

wouldn't do this for a production box, but just to get the ball rolling.

will post back on further updates to see how things pan out.

keunlee commented 3 years ago

so far, the deployment has been stuck on the diego-cell-0 and router-0 deployments.

Every 2.0s: oc get po                                                                                    

NAME                                     READY   STATUS                  RESTARTS   AGE
api-0                                    17/17   Running                 1          31m
auctioneer-0                             6/6     Running                 4          31m
cc-worker-0                              6/6     Running                 0          31m
cf-apps-dns-59f9f659f5-w7pmx             1/1     Running                 0          56m
coredns-quarks-546fdbd7cb-kd866          1/1     Running                 0          54m
coredns-quarks-546fdbd7cb-td6zj          1/1     Running                 0          54m
credhub-0                                8/8     Running                 0          31m
database-0                               2/2     Running                 0          54m
database-seeder-655e676ea24e1b8f-fzgv7   0/2     Completed               0          55m
diego-api-0                              9/9     Running                 3          31m
diego-cell-0                             0/12    Init:CrashLoopBackOff   8          31m
doppler-0                                6/6     Running                 0          31m
log-api-0                                9/9     Running                 0          31m
log-cache-0                              10/10   Running                 0          31m
nats-0                                   7/7     Running                 0          31m
router-0                                 0/7     Init:CrashLoopBackOff   8          31m
routing-api-0                            6/6     Running                 2          31m
scheduler-0                              12/12   Running                 1          31m
singleton-blobstore-0                    8/8     Running                 0          31m
tcp-router-0                             7/7     Running                 0          31m
uaa-0                                    9/9     Running                 0          31m

keunlee commented 3 years ago

the only log i I was able to extract from the diego-cell-0 pod was from the bosh-pre-start-cflinuxfs3-rootfs-setup container.

+ '[' -x /var/vcap/jobs/cflinuxfs3-rootfs-setup/bin/pre-start ']'
+ /var/vcap/jobs/cflinuxfs3-rootfs-setup/bin/pre-start
+ CONF_DIR=/var/vcap/jobs/cflinuxfs3-rootfs-setup/config
+ ROOTFS_PACKAGE=/var/vcap/packages/cflinuxfs3
+ ROOTFS_DIR=/var/vcap/data/rep/cflinuxfs3/rootfs
+ ROOTFS_TAR=/var/vcap/data/rep/cflinuxfs3/rootfs.tar
+ TRUSTED_CERT_FILE=/var/vcap/jobs/cflinuxfs3-rootfs-setup/config/certs/trusted_ca.crt
+ CA_DIR=/var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates/
+ '[' '!' -d /var/vcap/data/rep/cflinuxfs3/rootfs ']'
+ grep -q trusted_ca_certificates /var/vcap/jobs/cflinuxfs3-rootfs-setup/config/certs/trusted_ca.crt
+ JSON_CERT_FILE=/var/vcap/jobs/cflinuxfs3-rootfs-setup/config/certs/trusted_ca.json
+ cp /var/vcap/jobs/cflinuxfs3-rootfs-setup/config/certs/trusted_ca.crt /var/vcap/jobs/cflinuxfs3-rootfs-setup/config/certs/trusted_ca.json
+ TRUSTED_CERT_FILE=/var/vcap/jobs/cflinuxfs3-rootfs-setup/config/certs/trusted_ca.json
+ rm -f /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates/trusted-ca-1.crt /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates/trusted-ca-2.crt /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates/trusted-ca-3.crt
+ mkdir -p /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates/
+ /var/vcap/packages/rootfs-certsplitter-cflinuxfs3/bin/certsplitter /var/vcap/jobs/cflinuxfs3-rootfs-setup/config/certs/trusted_ca.json /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates/
trying to run update-ca-certificates...
trying to run update-ca-certificates...
trying to run update-ca-certificates...
failed to setup ca certificates
+ chmod 0644 /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates//trusted-ca-1.crt /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates//trusted-ca-2.crt /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates//trusted-ca-3.crt
+ updated_certs=1
+ retry_count=0
+ max_retry_count=3
+ set +e
+ '[' 1 -eq 0 ']'
+ '[' 0 -ge 3 ']'
+ echo 'trying to run update-ca-certificates...'
+ TMPDIR=/tmp
+ timeout --signal=KILL 60s chroot /var/vcap/data/rep/cflinuxfs3/rootfs /usr/sbin/update-ca-certificates -f
chroot: cannot change root directory to '/var/vcap/data/rep/cflinuxfs3/rootfs': Operation not permitted
+ updated_certs=125
+ retry_count=1
+ '[' 125 -eq 0 ']'
+ '[' 1 -ge 3 ']'
+ echo 'trying to run update-ca-certificates...'
+ TMPDIR=/tmp
+ timeout --signal=KILL 60s chroot /var/vcap/data/rep/cflinuxfs3/rootfs /usr/sbin/update-ca-certificates -f
chroot: cannot change root directory to '/var/vcap/data/rep/cflinuxfs3/rootfs': Operation not permitted
+ updated_certs=125
+ retry_count=2
+ '[' 125 -eq 0 ']'
+ '[' 2 -ge 3 ']'
+ echo 'trying to run update-ca-certificates...'
+ TMPDIR=/tmp
+ timeout --signal=KILL 60s chroot /var/vcap/data/rep/cflinuxfs3/rootfs /usr/sbin/update-ca-certificates -f
chroot: cannot change root directory to '/var/vcap/data/rep/cflinuxfs3/rootfs': Operation not permitted
+ updated_certs=125
+ retry_count=3
+ '[' 125 -eq 0 ']'
+ '[' 3 -ge 3 ']'
+ set -e
+ '[' 125 -ne 0 ']'
+ echo 'failed to setup ca certificates'
+ exit 1

real    0m0.117s
user    0m0.005s
sys 0m0.024s

jandubois commented 3 years ago

the role which governs the crd quarkssecrets.quarks.cloudfoundry.org, is missing a permission for creating/updating finalizers

@manno or @rohitsakala can you please take a look if this is a Quarks issue?

keunlee commented 3 years ago

For the router-0 pod, most of the containers reported logs, however, the only one that I could identify which reported something remotely off was the bpm-pre-starter-gorouter container:

+ /var/vcap/jobs/gorouter/bin/bpm-pre-start
unable to set CAP_SETFCAP effective capability: Operation not permitted

real    0m0.008s
user    0m0.003s
sys 0m0.004s

jandubois commented 3 years ago

Do a Google-search for forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil> shows several entries in the Openshift knowledge base as the first entries, e.g.

https://access.redhat.com/solutions/5372461 https://access.redhat.com/solutions/5085891

They have "verified solutions" that are unfortunately "SUBSCRIBER EXCLUSIVE CONTENT".

The fact that all the top Google hits point to Openshift seems to hints that this is a Red Hat issue...

keunlee commented 3 years ago

@jandubois I got a hint of what needed to be done from this source here:

https://github.com/operator-framework/operator-sdk/issues/1736#issuecomment-549433116

basically, add the "update" verb on "deployments/finalizers"

keunlee commented 3 years ago

so nonetheless, the original issue that I ran into, is resolved. hence closing this issue.

I will continue to try and work through getting things running on an Openshift 4.7.0 environment.

jandubois commented 3 years ago

basically, add the "update" verb on "deployments/finalizers"

Just to make sure I understand this correctly: the role that needs to add the update permission is part of the OpenShift config and not something that should/could be added in the Quarks helm chart, right?

keunlee commented 3 years ago

@jandubois I updated the following roles which were created when I installed the cf-operator/quarks-operator using the helm chart from the latest kubecf release bundle:

cf-operator-quarks-cluster 
cf-operator-quarks-job
cf-operator-quarks-secret
cf-operator-quarks-statefulset

To get the ball rolling, I added the following rule to each of the roles above:

rules:
  - verbs:
      - '*'
    apiGroups:
      - '*'
    resources:
      - '*'
  - ... more rules below this ...

I know it's overkill and a security risk in a production environment to add that amount of privilege, however for testing and getting the ball rolling, it's what worked for my case.

I did NOT test by adding in the update verb to the resource deployments/finalizers for any of the roles above.

Additionally, I would need to verify, but I believe those role names are pre-pended with the namespace where you install quarks helm chart, from the kubecf release bundle.

I hope that helps ;)

jandubois commented 3 years ago

I did NOT test by adding in the update verb to the resource deployments/finalizers for any of the roles above.

Hi @keunlee

Thanks for looking into this issue. The permissions you list above effectively make the role equivalent to cluster-admin, so we won't be changing quarks to do that.

If you could test that adding update for deployment/finalizers makes things work out-of-the-box on OpenShift, then we would put those into the helm chart, but otherwise we'll have to leave things as-is, as we don't have an OpenShift setup for testing...

rsletten commented 3 years ago

I'm having the exact same problem with deigo-cell-0 and router-0 on k8s using kubeadmin but I'm not sure where/how to add update to the verbs on deployment/finalizers

jandubois commented 3 years ago

I think @jbuns is the only person around here with any Openshift experience. If you do figure it out, please attach more detailed instructions for the next person stumbling over this! Thank you!

rsletten commented 3 years ago

Hey @jandubois I'm using vanilla kubernetes (not openshift) and I'm having the same problem as OP. Thanks!

jandubois commented 3 years ago

@rsletten Sorry, in that case I have no idea. I haven't seen or heard about the deployment/finalizers problem in any other context, so I think your k8s configuration must be at least inspired by the Openshift setup.

cloudfoundry-incubator / kubecf

Kubecf Helm install on Openshift 4.7 hangs on pod initialization for "cf-apps-dns" deployment #1700