IBM / cp4waiops-gitops

Manage Your IBM Cloud Pak for Watson AIOps With GitOps
https://ibm.github.io/cp4waiops-gitops/docs/
Apache License 2.0
11 stars 26 forks source link

Gitops : Error creating : pods xxxx is forbidden #229

Open Gilles-Plaquet opened 1 year ago

Gilles-Plaquet commented 1 year ago

While I was trying to deploy the Event and the AI-Manager I stumbled across an issue regarding permissions that results into a failed to create x. I added a screenshot in the attachments regarding the error. I get the same error on multiple resources that are trying to create objects.

I already checked that my argo-cd has te required cluster-role bindings. Just to make sure, I added a screenshot of the yaml file of this role binding aswell.

Hoping someone can help me resolve this issue ! Thanks in advance.

Kind regards, Gilles

Screenshot 2022-11-29 at 15 33 44 Screenshot 2022-11-29 at 17 22 00 Screenshot 2022-11-29 at 17 03 25
gyliu513 commented 1 year ago

@Gilles-Plaquet can you provide more info, at which step you failed based on the document at here https://ibm.github.io/cp4waiops-gitops/docs/how-to-deploy-cp4waiops-35 ? What is your OCP version?

Gilles-Plaquet commented 1 year ago

@gyliu513 the current Openshift version is 4.8.39. In the documentation it stated that i should be above 4.5 so i guess that should be fine.

I was able to create the ceph-cluster, and the shared application without any issues ( all app details seem to be healthy there) I guess its the moment it started installing the AI-Manager,I noticed part of the application getting degraded. Then i started to get the issue stated above.

Hope this information helped.

gyliu513 commented 1 year ago

@Gilles-Plaquet can you login to your ocp cluster and run the command oc get po -n cp4waiops to check the pod status? If there are some pods not running, can you run oc logs for one of the not running pod and append the log here?

Gilles-Plaquet commented 1 year ago

@gyliu513 there are none. in the error message he is telling that he can't create pods. thats why we don't see anything i think.

Screenshot 2022-11-29 at 18 00 56

I went to check in the openshift interface and then i see this :

Screenshot 2022-11-29 at 18 03 45
gyliu513 commented 1 year ago

thanks @Gilles-Plaquet , seems permission issue, but it is weird as you already have the cluster admin permission for argo CD, let me dig more.

In the meantime, can you run oc get pods -n rook-ceph to make sure all rook ceph pods are running well?

Gilles-Plaquet commented 1 year ago

@gyliu513 exactly, that was my reasoning as well... permission issue but i have all the cluster permissions. Thanks already for the help !

I also ran the command and everything in the rook-ceph cluster looks fine(too me) .

Screenshot 2022-11-29 at 18 34 28
gyliu513 commented 1 year ago

@Gilles-Plaquet let me check more with @huang-cn and @morningspace , they are located in China, and hope we can give you more info tomorrow, thanks!

Gilles-Plaquet commented 1 year ago

thanks a lot already !

morningspace commented 1 year ago

@Gilles-Plaquet I see you mentioned that you are deploying both Event and the AI-Manager. May I know which install option you are taking, e.g.: to install it one by one, or use the all-in-one template. Also, may I know which release you deploy? Can you share the outputs of oc get csv under namespace cp4waiops and ibm-common-services?

Gilles-Plaquet commented 1 year ago

@morningspace

I used the one by one installation since, the other one was in technical preview. I opted for release 3.5.

Screenshot 2022-11-30 at 08 58 08

Screenshot 2022-11-30 at 08 59 38

morningspace commented 1 year ago

@huang-cn did a test using 3.5 release today and it can work w/o problem, so I guess there must be something different on your cluster. Will check w/ @huang-cn and keep you posted tomorrow.

Gilles-Plaquet commented 1 year ago

@morningspace Thanks a lot! In case a webex,zoomcall,.... is easier to help solve the issue, that is possible ofcourse !

huang-cn commented 1 year ago

@Gilles-Plaquet I don't understand why there's this runAsUser: Invalid value: 1001 error appears here, the AIOPS should not use runAsUser scc option at all, it shouldn't specify any UID value and let OCP to allocate one. I'm wondering if the cagalog image in this env is the same as in ours?
Could you run commands below to check the catalogsource image and operator scc settings?

oc -n openshift-marketplace get catalogsource ibm-operator-catalog -oyaml|grep image:

oc -n cp4waiops get deploy iaf-core-operator-controller-manager -oyaml|grep -v 'f:securityContext'|grep securityContext  -A8

oc -n ibm-common-services get deploy ibm-common-service-operator -oyaml|grep -v 'f:securityContext'|grep securityContext  -A8
Gilles-Plaquet commented 1 year ago

Currently the namespace is not existing however yesterday it was, see the post above. this might be since openshift was unable to install the operator.

gyliu513 commented 1 year ago

@Gilles-Plaquet the AIOps never uninstall ibm-common-services components unless you remove them manually, it is weird, we can talk next Monday to dig more, hope it is OK. Thanks!