Closed keunlee closed 3 years ago
an update on my latest findings as far as my own investigation goes:
1) first thing's first, make sure to observe the logs of the operator!
export OPERATOR_POD=$(kubectl get pods -l name=cf-operator --namespace cf-operator --output name)
kubectl -n cf-operator logs $OPERATOR_POD -f
2) in regards to the issue that I was facing, the logs gave one particular tell-tale clue:
2021-03-07T21:24:02.211Z ERROR controller-runtime.manager.controller.boshdeployment-controller controller/controller.go:252 Reconciler error {"name": "kubecf", "namespace": "kubecf", "error": "failed to create quarks secrets for BOSH manifest 'kubecf/kubecf': creating or updating QuarksSecret 'kubecf/var-blobstore-admin-users-password': quarkssecrets.quarks.cloudfoundry.org \"var-blobstore-admin-users-password\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"}
that means that the role which governs the crd quarkssecrets.quarks.cloudfoundry.org
, is missing a permission for creating/updating finalizers
How to get around this?
I took a brute force "sledgehammer" approach, and assigned the following rules to the following roles:
rule
rules:
- verbs:
- '*'
apiGroups:
- '*'
resources:
- '*'
roles
cf-operator-quarks-cluster
cf-operator-quarks-job
cf-operator-quarks-secret
cf-operator-quarks-statefulset
wouldn't do this for a production box, but just to get the ball rolling.
will post back on further updates to see how things pan out.
so far, the deployment has been stuck on the diego-cell-0 and router-0 deployments.
Every 2.0s: oc get po
NAME READY STATUS RESTARTS AGE
api-0 17/17 Running 1 31m
auctioneer-0 6/6 Running 4 31m
cc-worker-0 6/6 Running 0 31m
cf-apps-dns-59f9f659f5-w7pmx 1/1 Running 0 56m
coredns-quarks-546fdbd7cb-kd866 1/1 Running 0 54m
coredns-quarks-546fdbd7cb-td6zj 1/1 Running 0 54m
credhub-0 8/8 Running 0 31m
database-0 2/2 Running 0 54m
database-seeder-655e676ea24e1b8f-fzgv7 0/2 Completed 0 55m
diego-api-0 9/9 Running 3 31m
diego-cell-0 0/12 Init:CrashLoopBackOff 8 31m
doppler-0 6/6 Running 0 31m
log-api-0 9/9 Running 0 31m
log-cache-0 10/10 Running 0 31m
nats-0 7/7 Running 0 31m
router-0 0/7 Init:CrashLoopBackOff 8 31m
routing-api-0 6/6 Running 2 31m
scheduler-0 12/12 Running 1 31m
singleton-blobstore-0 8/8 Running 0 31m
tcp-router-0 7/7 Running 0 31m
uaa-0 9/9 Running 0 31m
the only log i I was able to extract from the diego-cell-0
pod was from the bosh-pre-start-cflinuxfs3-rootfs-setup
container.
+ '[' -x /var/vcap/jobs/cflinuxfs3-rootfs-setup/bin/pre-start ']'
+ /var/vcap/jobs/cflinuxfs3-rootfs-setup/bin/pre-start
+ CONF_DIR=/var/vcap/jobs/cflinuxfs3-rootfs-setup/config
+ ROOTFS_PACKAGE=/var/vcap/packages/cflinuxfs3
+ ROOTFS_DIR=/var/vcap/data/rep/cflinuxfs3/rootfs
+ ROOTFS_TAR=/var/vcap/data/rep/cflinuxfs3/rootfs.tar
+ TRUSTED_CERT_FILE=/var/vcap/jobs/cflinuxfs3-rootfs-setup/config/certs/trusted_ca.crt
+ CA_DIR=/var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates/
+ '[' '!' -d /var/vcap/data/rep/cflinuxfs3/rootfs ']'
+ grep -q trusted_ca_certificates /var/vcap/jobs/cflinuxfs3-rootfs-setup/config/certs/trusted_ca.crt
+ JSON_CERT_FILE=/var/vcap/jobs/cflinuxfs3-rootfs-setup/config/certs/trusted_ca.json
+ cp /var/vcap/jobs/cflinuxfs3-rootfs-setup/config/certs/trusted_ca.crt /var/vcap/jobs/cflinuxfs3-rootfs-setup/config/certs/trusted_ca.json
+ TRUSTED_CERT_FILE=/var/vcap/jobs/cflinuxfs3-rootfs-setup/config/certs/trusted_ca.json
+ rm -f /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates/trusted-ca-1.crt /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates/trusted-ca-2.crt /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates/trusted-ca-3.crt
+ mkdir -p /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates/
+ /var/vcap/packages/rootfs-certsplitter-cflinuxfs3/bin/certsplitter /var/vcap/jobs/cflinuxfs3-rootfs-setup/config/certs/trusted_ca.json /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates/
trying to run update-ca-certificates...
trying to run update-ca-certificates...
trying to run update-ca-certificates...
failed to setup ca certificates
+ chmod 0644 /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates//trusted-ca-1.crt /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates//trusted-ca-2.crt /var/vcap/data/rep/cflinuxfs3/rootfs/usr/local/share/ca-certificates//trusted-ca-3.crt
+ updated_certs=1
+ retry_count=0
+ max_retry_count=3
+ set +e
+ '[' 1 -eq 0 ']'
+ '[' 0 -ge 3 ']'
+ echo 'trying to run update-ca-certificates...'
+ TMPDIR=/tmp
+ timeout --signal=KILL 60s chroot /var/vcap/data/rep/cflinuxfs3/rootfs /usr/sbin/update-ca-certificates -f
chroot: cannot change root directory to '/var/vcap/data/rep/cflinuxfs3/rootfs': Operation not permitted
+ updated_certs=125
+ retry_count=1
+ '[' 125 -eq 0 ']'
+ '[' 1 -ge 3 ']'
+ echo 'trying to run update-ca-certificates...'
+ TMPDIR=/tmp
+ timeout --signal=KILL 60s chroot /var/vcap/data/rep/cflinuxfs3/rootfs /usr/sbin/update-ca-certificates -f
chroot: cannot change root directory to '/var/vcap/data/rep/cflinuxfs3/rootfs': Operation not permitted
+ updated_certs=125
+ retry_count=2
+ '[' 125 -eq 0 ']'
+ '[' 2 -ge 3 ']'
+ echo 'trying to run update-ca-certificates...'
+ TMPDIR=/tmp
+ timeout --signal=KILL 60s chroot /var/vcap/data/rep/cflinuxfs3/rootfs /usr/sbin/update-ca-certificates -f
chroot: cannot change root directory to '/var/vcap/data/rep/cflinuxfs3/rootfs': Operation not permitted
+ updated_certs=125
+ retry_count=3
+ '[' 125 -eq 0 ']'
+ '[' 3 -ge 3 ']'
+ set -e
+ '[' 125 -ne 0 ']'
+ echo 'failed to setup ca certificates'
+ exit 1
real 0m0.117s
user 0m0.005s
sys 0m0.024s
the role which governs the crd
quarkssecrets.quarks.cloudfoundry.org
, is missing a permission for creating/updating finalizers
@manno or @rohitsakala can you please take a look if this is a Quarks issue?
For the router-0
pod, most of the containers reported logs, however, the only one that I could identify which reported something remotely off was the bpm-pre-starter-gorouter
container:
+ /var/vcap/jobs/gorouter/bin/bpm-pre-start
unable to set CAP_SETFCAP effective capability: Operation not permitted
real 0m0.008s
user 0m0.003s
sys 0m0.004s
Do a Google-search for forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>
shows several entries in the Openshift knowledge base as the first entries, e.g.
https://access.redhat.com/solutions/5372461 https://access.redhat.com/solutions/5085891
They have "verified solutions" that are unfortunately "SUBSCRIBER EXCLUSIVE CONTENT".
The fact that all the top Google hits point to Openshift seems to hints that this is a Red Hat issue...
@jandubois I got a hint of what needed to be done from this source here:
https://github.com/operator-framework/operator-sdk/issues/1736#issuecomment-549433116
basically, add the "update" verb on "deployments/finalizers"
so nonetheless, the original issue that I ran into, is resolved. hence closing this issue.
I will continue to try and work through getting things running on an Openshift 4.7.0 environment.
basically, add the "update" verb on "deployments/finalizers"
Just to make sure I understand this correctly: the role that needs to add the update
permission is part of the OpenShift config and not something that should/could be added in the Quarks helm chart, right?
@jandubois I updated the following roles which were created when I installed the cf-operator/quarks-operator using the helm chart from the latest kubecf release bundle:
cf-operator-quarks-cluster
cf-operator-quarks-job
cf-operator-quarks-secret
cf-operator-quarks-statefulset
To get the ball rolling, I added the following rule to each of the roles above:
rules:
- verbs:
- '*'
apiGroups:
- '*'
resources:
- '*'
- ... more rules below this ...
I know it's overkill and a security risk in a production environment to add that amount of privilege, however for testing and getting the ball rolling, it's what worked for my case.
I did NOT test by adding in the update
verb to the resource deployments/finalizers
for any of the roles above.
Additionally, I would need to verify, but I believe those role names are pre-pended with the namespace where you install quarks helm chart, from the kubecf release bundle.
I hope that helps ;)
I did NOT test by adding in the
update
verb to the resourcedeployments/finalizers
for any of the roles above.
Hi @keunlee
Thanks for looking into this issue. The permissions you list above effectively make the role equivalent to cluster-admin
, so we won't be changing quarks to do that.
If you could test that adding update
for deployment/finalizers
makes things work out-of-the-box on OpenShift, then we would put those into the helm chart, but otherwise we'll have to leave things as-is, as we don't have an OpenShift setup for testing...
I'm having the exact same problem with deigo-cell-0 and router-0 on k8s using kubeadmin but I'm not sure where/how to add update to the verbs on deployment/finalizers
I think @jbuns is the only person around here with any Openshift experience. If you do figure it out, please attach more detailed instructions for the next person stumbling over this! Thank you!
Hey @jandubois I'm using vanilla kubernetes (not openshift) and I'm having the same problem as OP. Thanks!
@rsletten Sorry, in that case I have no idea. I haven't seen or heard about the deployment/finalizers
problem in any other context, so I think your k8s configuration must be at least inspired by the Openshift setup.
Describe the bug
Deployment of KubeCF helm chart onto Openshift 4.7.0 results in an incomplete installation which hangs on initialization for "cf-apps-dns" deployment pod
To Reproduce
1) obtain the latest bundle (v2.7.12) and decompress:
2) create "cf-operator" project/namespace
3) install "cf-operator/quarks" helm chart from release bundle:
4) install "kubecf" helm chart from release bundle:
results in the
cf-app-dns
deployment having a pod stuck indefinitely initializing.if we observe the course of events after the KubeCF chart installation:
the first warning encountered in the event stream reads:
Expected behavior
The
cf-apps-dns
deployment should continue, and be able to find all it's secrets (i.e.var-cf-app-sd-client-tls
)The
cf-apps-dns
deployment should not have it's corresponding pod hang on initializationEnvironment
ovirt-csi-sc
Additional context
system_domain
, which is specifiedvalues.yaml
. As to which details are absolutely required/necessary, to minimally get an installation running on Openshift, this does not immediately appear to be clear to me.kubecf
namespace - events.txt