Open Gilles-Plaquet opened 2 years ago
@Gilles-Plaquet can you provide more info, at which step you failed based on the document at here https://ibm.github.io/cp4waiops-gitops/docs/how-to-deploy-cp4waiops-35 ? What is your OCP version?
@gyliu513 the current Openshift version is 4.8.39. In the documentation it stated that i should be above 4.5 so i guess that should be fine.
I was able to create the ceph-cluster, and the shared application without any issues ( all app details seem to be healthy there) I guess its the moment it started installing the AI-Manager,I noticed part of the application getting degraded. Then i started to get the issue stated above.
Hope this information helped.
@Gilles-Plaquet can you login to your ocp cluster and run the command oc get po -n cp4waiops
to check the pod status? If there are some pods not running, can you run oc logs
for one of the not running pod and append the log here?
@gyliu513 there are none. in the error message he is telling that he can't create pods. thats why we don't see anything i think.
I went to check in the openshift interface and then i see this :
thanks @Gilles-Plaquet , seems permission issue, but it is weird as you already have the cluster admin permission for argo CD, let me dig more.
In the meantime, can you run oc get pods -n rook-ceph
to make sure all rook ceph pods are running well?
@gyliu513 exactly, that was my reasoning as well... permission issue but i have all the cluster permissions. Thanks already for the help !
I also ran the command and everything in the rook-ceph cluster looks fine(too me) .
@Gilles-Plaquet let me check more with @huang-cn and @morningspace , they are located in China, and hope we can give you more info tomorrow, thanks!
thanks a lot already !
@Gilles-Plaquet
I see you mentioned that you are deploying both Event and the AI-Manager. May I know which install option you are taking, e.g.: to install it one by one, or use the all-in-one template. Also, may I know which release you deploy? Can you share the outputs of oc get csv
under namespace cp4waiops
and ibm-common-services
?
@morningspace
I used the one by one installation since, the other one was in technical preview. I opted for release 3.5.
@huang-cn did a test using 3.5 release today and it can work w/o problem, so I guess there must be something different on your cluster. Will check w/ @huang-cn and keep you posted tomorrow.
@morningspace Thanks a lot! In case a webex,zoomcall,.... is easier to help solve the issue, that is possible ofcourse !
@Gilles-Plaquet I don't understand why there's this runAsUser: Invalid value: 1001
error appears here, the AIOPS should not use runAsUser
scc option at all, it shouldn't specify any UID value and let OCP to allocate one. I'm wondering if the cagalog image in this env is the same as in ours?
Could you run commands below to check the catalogsource image and operator scc settings?
oc -n openshift-marketplace get catalogsource ibm-operator-catalog -oyaml|grep image:
oc -n cp4waiops get deploy iaf-core-operator-controller-manager -oyaml|grep -v 'f:securityContext'|grep securityContext -A8
oc -n ibm-common-services get deploy ibm-common-service-operator -oyaml|grep -v 'f:securityContext'|grep securityContext -A8
oc -n openshift-marketplace get catalogsource ibm-operator-catalog -oyaml|grep image:
oc -n cp4waiops get deploy iaf-core-operator-controller-manager -oyaml|grep -v 'f:securityContext'|grep securityContext -A8
oc -n ibm-common-services get deploy ibm-common-service-operator -o yaml|grep -v 'f:securityContext'|grep securityContext -A8
Currently the namespace is not existing however yesterday it was, see the post above. this might be since openshift was unable to install the operator.
@Gilles-Plaquet the AIOps never uninstall ibm-common-services
components unless you remove them manually, it is weird, we can talk next Monday to dig more, hope it is OK. Thanks!
While I was trying to deploy the Event and the AI-Manager I stumbled across an issue regarding permissions that results into a failed to create x. I added a screenshot in the attachments regarding the error. I get the same error on multiple resources that are trying to create objects.
I already checked that my argo-cd has te required cluster-role bindings. Just to make sure, I added a screenshot of the yaml file of this role binding aswell.
Hoping someone can help me resolve this issue ! Thanks in advance.
Kind regards, Gilles