admiraltyio / admiralty

A system of Kubernetes controllers that intelligently schedules workloads across clusters.
https://admiralty.io
Apache License 2.0
683 stars 86 forks source link

need help to install Admiralty 0.14.1 in OpenShift 4.7 #128

Open hfwen0502 opened 2 years ago

hfwen0502 commented 2 years ago

I am trying to explore the capabilities that Admiralty can offer in the OCP cluster provisioned on IBM Cloud. Below is the info. about the OCP cluster and the cert-manager version installed here:

[root@hf-ocp-login1 ]# oc version
Client Version: 4.9.7
Server Version: 4.7.36
Kubernetes Version: v1.20.0+bbbc079

[root@hf-ocp-login1 ]# helm ls -A
NAME            NAMESPACE       REVISION    UPDATED                                 STATUS      CHART               APP VERSION
cert-manager    cert-manager    1           2021-11-23 18:27:06.126167907 +0000 UTC deployed    cert-manager-v1.6.1 v1.6.1     

However, when trying to install Admiralty, I encountered issues shown below:

Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: [unable to recognize "": no matches for kind "Certificate" in version "cert-manager.io/v1alpha2", unable to recognize "": no matches for kind "Issuer" in version "cert-manager.io/v1alpha2"]

Any idea how to fix this?

adrienjt commented 2 years ago

cert-manager 1.6 stopped serving alpha and beta APIs: https://github.com/jetstack/cert-manager/releases/tag/v1.6.0

helm template ... | cmctl convert -f - | kubectl apply -f -

instead of

helm install ...

should work. Please feel free to submit a PR to implement the conversions in the chart (for helm install to work again). We haven't upgraded to cert-manager 1.6 yet on our side, so haven't had an urgent need for the conversion.

hfwen0502 commented 2 years ago

@adrienjt Thanks. I also just found out how to get around the helm install issue using "helm template" route. Things seem to be working fine now.

hfwen0502 commented 2 years ago

Everything works fine out of the box using the Kubernetes clusters. However, there are quite few things that users need to change in order to get it working on OpenShift (e.g. clusterroles). Now I am facing an issue for the virtual node which represents the workload cluster:

oc describe node admiralty-default-ocp-eu2-1-6198a17ca3

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)

Any idea why the resources (cpu/memory) on the virtual node are all 0? I am using the service account to do the authentication between the target and workload clusters. It works fine on K8s but not OpenShift.

hfwen0502 commented 2 years ago

I was able to figure out how to set up a kubeconfig secret for OpenShift clusters. Everything works beautifully. Love the tool!

adrienjt commented 2 years ago

Hi @hfwen0502, I'm glad you were able to figure this out. Would you care to contribute how to set up a kubeconfig secret for OpenShift clusters to the Admiralty documentation? (PR under docs/)

hfwen0502 commented 2 years ago

Of course. Would be happy to contribute the documentation. Can the platform be based on the IKS and ROKS services on IBM Cloud? I am working in the hybrid cloud organization in IBM Research. By the way, RBAC needs to be adjusted as well on OpenShift.

oc edit clusterrole admiralty-multicluster-scheduler-source

- apiGroups:
  - ""
  resources:
  - pods
  # add the line below
  - pods/finalizers
  verbs:
  - list
  # add the line below
  - '*'
- apiGroups:
  - multicluster.admiralty.io
  resources:
  - podchaperons
  # add the three lines below
  - podchaperons/finalizers
  - sources
  - sources/finalizers
adrienjt commented 2 years ago

Can the platform be based on the IKS and ROKS services on IBM Cloud?

Yes, no problem.

By the way, RBAC needs to be adjusted as well on OpenShift.

Could you contribute the RBAC changes to the Helm chart?

hfwen0502 commented 2 years ago

A PR is submitted which includes both RBAC and doc changes: https://github.com/admiraltyio/admiralty/pull/134

hfwen0502 commented 2 years ago

Things only work in the default namespace on OpenShift. There are issues related to scc when we set up Admiralty in the non-default namespace. Errors are shown below:


E0128 20:33:01.214968       1 controller.go:117] error syncing 'hfwen/test-job-hscvc-ms6r7': pods "test-job-hscvc-ms6r7" 
is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user 
or serviceaccount, provider restricted: .spec.securityContext.fsGroup: Invalid value: []int64{1000720000}: 1000720000 is
not an allowed group, provider restricted: .spec.securityContext.seLinuxOptions.level: Invalid value: "s0:c27,c9": must be s0:c26,c25,
spec.containers[0].securityContext.runAsUser: Invalid value: 1000720000: must be in the ranges: [1000700000, 1000709999], 
spec.containers[0].securityContext.seLinuxOptions.level: Invalid value: "s0:c27,c9": must be s0:c26,c25, provider 
"ibm-restricted-scc": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by 
user or serviceaccount, provider "ibm-anyuid-scc": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": 
Forbidden: not usable by user or serviceaccount, provider "ibm-anyuid-hostpath-scc": Forbidden: not usable by user or serviceaccount,
provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden:
not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider 
"ibm-anyuid-hostaccess-scc": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by 
user or serviceaccount, provider "ibm-privileged-scc": Forbidden: not usable by user or serviceaccount, provider "privileged": 
Forbidden: not usable by user or serviceaccount], requeuing```
adrienjt commented 2 years ago

when we set up Admiralty in the non-default namespace

When Admiralty is installed in the non-default namespace and/or when Sources/Targets are set up (and pods created) in the non-default namespace?

Which SCC are you expecting to apply? restricted (the only one allowed, but not passing) or something else? If restricted, have you tried configuring your test job's security context to make it pass the policy? If something else, have you tried allowing the SCC for the pod's service account in that namespace?

hfwen0502 commented 2 years ago

@adrienjt Sorry. I should have made myself clear. Admiralty is always installed in the Admiralty namespace. The issue related to SCC occurs when sources/targets are set up in the non-default namespace. Let's assume sources/targets are in the hfwen namespace. In the annotation of the proxy pod at the source, we have the following:

* Source Proxy  Pod
Annotations:    multicluster.admiralty.io/elect:
                multicluster.admiralty.io/sourcepod-manifest:
                  apiVersion: v1
                  kind: Pod
                  spec:
                    containers:
                      securityContext:
                        capabilities:
                          drop:
                          - KILL
                          - MKNOD
                          - SETGID
                          - SETUID
                        runAsUser: 1000690000
                    securityContext:
                      fsGroup: 1000680000
                      seLinuxOptions:
                        level: s0:c26,c15

On the target cluster, the PodChaperon object has this:

oc get podchaperons hf1-job-tvlrx-p7sp2 -o yaml
apiVersion: multicluster.admiralty.io/v1alpha1
kind: PodChaperon
spec:
  containers:
    securityContext:
      capabilities:
        drop:
        - KILL
        - MKNOD
        - SETGID
        - SETUID
      runAsUser: 1000680000
  securityContext:
    fsGroup: 1000680000
    seLinuxOptions:
      level: s0:c26,c15

This is a problem because the target cluster actually expects SCC in the hfwen namespace with the following:

  securityContext:
    fsGroup: 1000780000 <= something with the range [1000700000, 1000709999]
    seLinuxOptions:
      level: s0:c26,c15 <= should be s0:c26,c25

Any idea how to resolve this? When sources/targets are in the default namespace, the securityContext stays empty. That's why we did not hit this problem. I have also tried to adjust the SCC in the service account, which did not work.

hfwen0502 commented 2 years ago

On OpenShift, it always comes with 3 service accounts by default.

NAME       SECRETS   AGE
builder    2         44m
default    2         44m
deployer   2         44m

Adding the privileged SCC to the default sa in my hfwen namespace (both source and target) seems to fix the SCC issue.

oc adm policy add-scc-to-user privileged -z default -n hfwen

@adrienjt Is this something you have in mind? Is this a good practice or the only way to resolve it?

hfwen0502 commented 2 years ago

ok. Find a better solution. The OpenShift clusters on IBM Cloud come with other preconfigured SCC groups. We can use a less-privileged one instead of privileged.