BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

Investigate disabling copied CSVs to support large clusters #3006

Closed StevenBarre closed 8 months ago

StevenBarre commented 2 years ago

Describe the issue https://docs.openshift.com/container-platform/4.10/release_notes/ocp-4-10-release-notes.html#ocp-4-10-copied-csvs

When an Operator is installed by Operator Lifecycle Manager (OLM), a simplified copy of its cluster service version (CSV) is created in every namespace that the Operator is configured to watch. These CSVs are known as copied CSVs; they identify controllers that are actively reconciling resource events in a given namespace.

On large clusters, with namespaces and installed Operators potentially in the hundreds or thousands, copied CSVs can consume an untenable amount of resources, such as OLM’s memory usage, cluster etcd limits, and networking bandwidth. To support these larger clusters, cluster administrators can now disable copied CSVs for Operators that are installed with the AllNamespaces mode.

For more details, see Configuring Operator Lifecycle Manager features.

Discuss this with Matt and investigate if we should implement this in our clusters.

What is the Value/Impact? Improved performance of the API/ETCD in large clusters like Silver

What is the plan? How will this get completed? Read the docs, discuss implementation with Matt, test in LAB, document how to implement on PROD clusters during the upgrade

Identify any dependencies OCP 4.10 upgrade in CLAB

Definition of done CSVs disabled in lab and steps documented, or comment here why we wont use this feature.

tmorik commented 1 year ago

Asking question Matt in RC: https://chat.developer.gov.bc.ca/channel/devops-operations-lab?msg=CqfeWinoAiPEhH7P8

tmorik commented 1 year ago

Summary of my investigation :

Need to confirm these in Lab Clusters.

tmorik commented 1 year ago

Disabling Copied CSV in AMS ROSA for testing.

ADVSOL-AMS/redhat-ods-operator ~ $ oc get OLMConfig cluster
NAME      AGE
cluster   260d
ADVSOL-AMS/redhat-ods-operator ~ $ oc apply -f - <<EOF
> apiVersion: operators.coreos.com/v1
kind: OLMConfig
metadata:
  name: cluster
spec:
  features:
    disableCopiedCSVs: true
> EOF
Warning: resource olmconfigs/cluster is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on                                                          resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically.
olmconfig.operators.coreos.com/cluster configured

// Status

$ oc get OLMConfig -o yaml
apiVersion: v1
items:
- apiVersion: operators.coreos.com/v1
  kind: OLMConfig
  metadata:
<...>
  spec:
    features:
      disableCopiedCSVs: true
  status:
    conditions:
    - lastTransitionTime: "2023-04-14T23:47:23Z"
      message: Copied CSVs are disabled and no unexpected copied CSVs were found for
        operators installed in AllNamespace mode
      reason: CopiedCSVsDisabled
      status: "True"
      type: DisabledCopiedCSVs
<...>

// Events in openshift-operators also shows below;

ADVSOL-AMS/default ~ $ oc get events -n openshift-operators | grep DisabledCopiedCSVs
2m5s        Warning   DisabledCopiedCSVs   clusterserviceversion/devworkspace-operator.v0.19.1-0.1679521112.p   CSV copying disabled for openshift-operators/devworkspace-operator.v0.19.1-0.1679521112.p
2m5s        Warning   DisabledCopiedCSVs   clusterserviceversion/jaeger-operator.v1.43.0                        CSV copying disabled for openshift-operators/jaeger-operator.v1.43.0
2m5s        Warning   DisabledCopiedCSVs   clusterserviceversion/kiali-operator.v1.57.6                         CSV copying disabled for openshift-operators/kiali-operator.v1.57.6
2m5s        Warning   DisabledCopiedCSVs   clusterserviceversion/servicemeshoperator.v2.3.2                     CSV copying disabled for openshift-operators/servicemeshoperator.v2.3.2
2m5s        Warning   DisabledCopiedCSVs   clusterserviceversion/web-terminal.v1.7.0-0.1681197295.p             CSV copying disabled for openshift-operators/web-terminal.v1.7.0-0.1681197295.p

Keep this running over the weekend and will apply to KLAB/KLAB2 nextweek.

tmorik commented 1 year ago

Some more testing on ROSA cluster - disable/enable Copied CSV

// When `disableCopiedCSVs: true` - No CSVs can be seen in the user namespace
ADVSOL-AMS/tats ~ $ oc get csv
No resources found in tats namespace.

// Turn off disabling again --  `disableCopiedCSVs: false`.
apiVersion: v1
items:
- apiVersion: operators.coreos.com/v1
  kind: OLMConfig
  metadata:
<...>
  spec:
    features:
      disableCopiedCSVs: false

// Copied CSVs in the user namespace are coming back
ADVSOL-AMS/tats ~ $ oc get csv
NAME                                           DISPLAY                            VERSION                 REPLACES                                  PHASE
devworkspace-operator.v0.19.1-0.1679521112.p   DevWorkspace Operator              0.19.1+0.1679521112.p   devworkspace-operator.v0.19.1             Succeeded
elasticsearch-operator.v5.6.4                  OpenShift Elasticsearch Operator   5.6.4                   elasticsearch-operator.v5.6.3             Succeeded
jaeger-operator.v1.44.0                        Community Jaeger Operator          1.44.0                  jaeger-operator.v1.43.0                   Succeeded
kiali-operator.v1.57.6                         Kiali Operator                     1.57.6                  kiali-operator.v1.57.5                    Succeeded
observability-operator.v0.0.20                 Observability Operator             0.0.20                  observability-operator.v0.0.19            Succeeded
rhods-operator.1.24.0                          Red Hat OpenShift Data Science     1.24.0                  rhods-operator.1.23.0                     Succeeded
route-monitor-operator.v0.1.493-a866e7c        Route Monitor Operator             0.1.493-a866e7c         route-monitor-operator.v0.1.489-7d9fe90   Succeeded
servicemeshoperator.v2.3.2                     Red Hat OpenShift Service Mesh     2.3.2-0                 servicemeshoperator.v2.3.1                Succeeded
web-terminal.v1.7.0-0.1681197295.p             Web Terminal                       1.7.0+0.1681197295.p    web-terminal.v1.7.0                       Succeeded
ADVSOL-AMS/tats ~ $

Web Console

When copied csv is enabled - disableCopiedCSVs: false

image.png

When copied csv is disabled - disableCopiedCSVs: true

image.png

But still you can install operator from the OperatorHub

image.png

Installed:

image.png

CLI
ADVSOL-AMS/tats ~ $ oc get csv
NAME                              DISPLAY                       VERSION   REPLACES   PHASE
cloud-native-postgresql.v1.19.1   EDB Postgres for Kubernetes   1.19.1               Succeeded
tmorik commented 1 year ago

I will run the same tests on K/CLAB next

tmorik commented 1 year ago

Tested on KLAB and KLAB2. Both got same results as the ROSA's test above.

KLAB2 test

// create a project
NSX KLAB2/default ~ $ oc new-project tats-test
Now using project "tats-test" on server "https://api.klab2.devops.gov.bc.ca:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=k8s.gcr.io/e2e-test-images/agnhost:2.33 -- /agnhost serve-hostname

// check copied CSV
NSX KLAB2/tats-test ~ $ oc get csv
NAME                                             DISPLAY                                           VERSION                 REPLACES                                       PHASE
amqstreams.v2.3.0-3                              AMQ Streams                                       2.3.0-3                 amqstreams.v2.3.0-2                            Succeeded
devworkspace-operator.v0.19.1-0.1679521112.p     DevWorkspace Operator                             0.19.1+0.1679521112.p   devworkspace-operator.v0.18.1-0.1675929565.p   Succeeded
eap-operator.v2.3.10                             JBoss EAP                                         2.3.10                  eap-operator.v2.3.9                            Succeeded
elasticsearch-operator.5.5.7                     OpenShift Elasticsearch Operator                  5.5.7                   elasticsearch-operator.5.5.6                   Succeeded
must-gather-operator.v1.1.2                      Must Gather Operator                              1.1.2                   must-gather-operator.v1.1.1                    Succeeded
openshift-gitops-operator.v1.6.6                 Red Hat OpenShift GitOps                          1.6.6                   openshift-gitops-operator.v1.6.5               Succeeded
openshift-pipelines-operator-rh.v1.8.2           Red Hat OpenShift Pipelines                       1.8.2                                                                  Succeeded
red-hat-camel-k-operator.v1.8.2-0.1675913507.p   Red Hat Integration - Camel K                     1.8.2+0.1675913507.p    red-hat-camel-k-operator.v1.6.10               Succeeded
rhacs-operator.v3.74.2                           Advanced Cluster Security for Kubernetes          3.74.2                  rhacs-operator.v3.74.1                         Succeeded
service-registry-operator.v2.1.4                 Red Hat Integration - Service Registry Operator   2.1.4                   service-registry-operator.v2.1.3               Succeeded
NSX KLAB2/tats-test ~ $

// Disable copied CSV

oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1
kind: OLMConfig
metadata:
  name: cluster
spec:
  features:
    disableCopiedCSVs: true 
EOF

// OLMConfig - `disableCopiedCSVs: true`
NSX KLAB2/tats-test ~ $ oc get OLMConfig cluster -o yaml
apiVersion: operators.coreos.com/v1
kind: OLMConfig
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"operators.coreos.com/v1","kind":"OLMConfig","metadata":{"annotations":{},"name":"cluster"},"spec":{"features":{"disableCopiedCSVs":true}}}
    release.openshift.io/create-only: "true"
  creationTimestamp: "2023-02-01T20:59:41Z"
  generation: 2
  name: cluster
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: 4a402525-1447-400f-858e-d05940feef96
  resourceVersion: "442952687"
  uid: d8740def-bf6a-4a2c-bec5-9021853d44eb
spec:
  features:
    disableCopiedCSVs: true
status:
  conditions:
  - lastTransitionTime: "2023-02-01T21:00:25Z"
    message: Copied CSVs are disabled and at least one copied CSV was found for an
      operator installed in AllNamespace mode
    reason: CopiedCSVsFound
    status: "False"
    type: DisabledCopiedCSVs

// After a few minutes, All copied csvs are gone.
NSX KLAB2/tats-test ~ $ oc get csv
No resources found in tats-test namespace.

KLAB test

// Create a new project
KLAB/tats-test ~ $ oc new-project tats-dev
Now using project "tats-dev" on server "https://api.klab.devops.gov.bc.ca:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=k8s.gcr.io/e2e-test-images/agnhost:2.33 -- /agnhost serve-hostname

// Copied CSVs are created in the new namespace.
KLAB/tats-dev ~ $ oc get  csv
NAME                                           DISPLAY                                           VERSION                 REPLACES                                       PHASE
custom-metrics-autoscaler.v2.8.2-174           Custom Metrics Autoscaler                         2.8.2-174               custom-metrics-autoscaler.v2.7.1               Succeeded
devworkspace-operator.v0.19.1-0.1679521112.p   DevWorkspace Operator                             0.19.1+0.1679521112.p   devworkspace-operator.v0.18.1-0.1675929565.p   Succeeded
eap-operator.v2.3.10                           JBoss EAP                                         2.3.10                  eap-operator.v2.3.9                            Succeeded
elasticsearch-operator.5.5.5                   OpenShift Elasticsearch Operator                  5.5.5                   elasticsearch-operator.5.5.4                   Succeeded
must-gather-operator.v1.1.2                    Must Gather Operator                              1.1.2                   must-gather-operator.v1.1.1                    Succeeded
rhacs-operator.v3.74.2                         Advanced Cluster Security for Kubernetes          3.74.2                  rhacs-operator.v3.74.1                         Succeeded
service-registry-operator.v2.1.4               Red Hat Integration - Service Registry Operator   2.1.4                   service-registry-operator.v2.1.3               Succeeded
KLAB/tats-dev ~ $

// Current OMLconfig -- "CopiedCSVsEnabled"

KLAB/tats-dev ~ $ oc get OLMConfig cluster -o yaml
apiVersion: operators.coreos.com/v1
kind: OLMConfig
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    release.openshift.io/create-only: "true"
  creationTimestamp: "2022-08-31T20:23:05Z"
  generation: 1
  name: cluster
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: d576083b-6ab6-4947-b035-26db5a5abc31
  resourceVersion: "1706897937"
  uid: 91272420-b04d-4d66-a6be-2a2c6f676e15
status:
  conditions:
  - lastTransitionTime: "2022-08-31T20:25:41Z"
    message: Copied CSVs are enabled and present across the cluster
    reason: CopiedCSVsEnabled
    status: "False"
    type: DisabledCopiedCSVs

// Disable copied CSV

KLAB/tats-dev ~ $ oc apply -f - <<EOF
> apiVersion: operators.coreos.com/v1
> kind: OLMConfig
> metadata:
>   name: cluster
> spec:
>   features:
>     disableCopiedCSVs: true
> EOF
Warning: resource olmconfigs/cluster is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by oc apply. oc apply should only be used on resources created declaratively by either oc create --save-config or oc apply. The missing annotation will be patched automatically.
olmconfig.operators.coreos.com/cluster configured

// New OMLconfig -- "DisabledCopiedCSVs"

KLAB/tats-dev ~ $ oc get OLMConfig cluster -o yaml
apiVersion: operators.coreos.com/v1
kind: OLMConfig
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"operators.coreos.com/v1","kind":"OLMConfig","metadata":{"annotations":{},"name":"cluster"},"spec":{"features":{"disableCopiedCSVs":true}}}
    release.openshift.io/create-only: "true"
  creationTimestamp: "2022-08-31T20:23:05Z"
  generation: 2
  name: cluster
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: d576083b-6ab6-4947-b035-26db5a5abc31
  resourceVersion: "1708553827"
  uid: 91272420-b04d-4d66-a6be-2a2c6f676e15
spec:
  features:
    disableCopiedCSVs: true
status:
  conditions:
  - lastTransitionTime: "2022-08-31T20:25:41Z"
    message: Copied CSVs are disabled and at least one copied CSV was found for an
      operator installed in AllNamespace mode
    reason: CopiedCSVsFound
    status: "False"
    type: DisabledCopiedCSVs

// After a few minutes, All copied csvs are gone.
KLAB/tats-dev ~ $ oc get csv
No resources found in tats-dev namespace.
tmorik commented 1 year ago

Existed apps are not affected by this change.

For instance, the below grafana-operator had installed in the openshift-bcgov-grafana-test namespace directly before DisabledCopiedCSVs. But it's still visible in the Installed Operators list (because it's been installed in this NS directly) as below after copied CSV got removed.

image.png

...and apps are up and running as before.

image.png

https://grafana-route-openshift-bcgov-grafana-test.apps.klab.devops.gov.bc.ca/d/SPNWEKL7k/haproxy-router-2?orgId=1&from=now-5m&to=now&refresh=5m

I will keep DisabledCopiedCSVs clusters for a while and see if anyone screams.

tmorik commented 1 year ago

Checked etcd dashboards on KLAB and KLAB2. DB size and Memory size have been decreased since Copied CSV is Disabled on both cluster.

KLAB

image.png

KLAB2

image.png

tmorik commented 1 year ago

Since these are relatively small clusters, this gap appears small, but when applied to larger clusters such as Silver, we may see much larger gaps in other metrics as well as DB size and memory usage.

tmorik commented 1 year ago

FYI - These are CLAB and SILVER clusters' graphs in the same time period to compare. They still have copided CSV in all namespaces.

CLAB

image.png

SILVER

image.png

tmorik commented 1 year ago

CLAB is disabled also for testing copied CSV in OCP4.12.

tmorik commented 1 year ago

Doc PR: https://github.com/bcgov-c/advsol-docs/pull/268

tmorik commented 1 year ago

One customer in CLAB asked about copied CSV because the yare no longer have a visibility to their copied CSV. We'll wait to see what that customer says about the disabled CSV in CLAB. If they ok with it, we'll add this `` disabling copied CSV to the upgrade docs and roll it into that change.

tmorik commented 1 year ago

Waiting for customer evaluation.

StevenBarre commented 1 year ago

Disabling copied CSVs causes the operators to not show in the admin web ui for customers. This can make it more difficult to manage custom resources for these operators. Let's roll back the change in LAB and then close this ticket.

tmorik commented 1 year ago

OK. Enabled(roll backed) copied CSVs in all Lab clusters.

oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1
kind: OLMConfig
metadata:
  name: cluster
spec:
  features:
    disableCopiedCSVs: false
EOF

copied CSVs are now back to each namespaces...

CLAB/openshift-config ~ $ oc get csv -n te1690b-test
NAME                                              DISPLAY                                           VERSION                 REPLACES                                         PHASE
amqstreams.v2.3.0-3                               AMQ Streams                                       2.3.0-3                 amqstreams.v2.3.0-2                              Succeeded
custom-metrics-autoscaler.v2.8.2-174              Custom Metrics Autoscaler                         2.8.2-174               custom-metrics-autoscaler.v2.7.1                 Succeeded
devworkspace-operator.v0.20.0                     DevWorkspace Operator                             0.20.0                  devworkspace-operator.v0.19.1-0.1682321189.p     Succeeded
eap-operator.v2.3.10                              JBoss EAP                                         2.3.10                  eap-operator.v2.3.9                              Succeeded
elasticsearch-operator.v5.6.5                     OpenShift Elasticsearch Operator                  5.6.5                   elasticsearch-operator.v5.6.4                    Succeeded
must-gather-operator.v1.1.2                       Must Gather Operator                              1.1.2                   must-gather-operator.v1.1.1                      Succeeded
openshift-gitops-operator.v1.7.4                  Red Hat OpenShift GitOps                          1.7.4                   openshift-gitops-operator.v1.7.3                 Succeeded
red-hat-camel-k-operator.v1.10.0-0.1682325781.p   Red Hat Integration - Camel K                     1.10.0+0.1682325781.p   red-hat-camel-k-operator.v1.8.2-0.1675913507.p   Succeeded
rhacs-operator.v3.74.3                            Advanced Cluster Security for Kubernetes          3.74.3                  rhacs-operator.v3.74.2                           Succeeded
service-registry-operator.v2.1.5                  Red Hat Integration - Service Registry Operator   2.1.5                   service-registry-operator.v2.1.4                 Succeeded
web-terminal.v1.7.0-0.1682321121.p                Web Terminal                                      1.7.0+0.1682321121.p    web-terminal.v1.6.0                              Succeeded
tmorik commented 1 year ago

I will close this ticket.

tmorik commented 1 year ago

FYI - Looks like copied CSVs will be on user's web console even though it's been disabled on OCP4.13.

https://docs.openshift.com/container-platform/4.13/release_notes/ocp-4-13-release-notes.html#ocp-4-13-olm-disabled-csvs

When copied CSVs are disabled by a cluster administrator, the web console is modified to show copied CSVs from the openshift namespace in every namespace for regular users, even though the CSVs are not actually copied to every namespace. This allows regular users to still be able to view the details of these Operators in their namespaces and create custom resources (CRs) brought in by globally installed Operators.

tmorik commented 1 year ago

Reopened this issue. We will try this again after the OCP 4.13 upgrade.

StevenBarre commented 10 months ago

Has been disabled in Silver due to performance issues. Will re-attempt doing in all clusters with 4.13 upgrade in January.

StevenBarre commented 8 months ago

https://github.com/bcgov-c/platform-gitops-gen/pull/797