BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)
Apache License 2.0
8 stars 17 forks source link

Red Hat Integration Upgrade Issues - Checking into Downgrade options for AG #4589

Closed wmhutchison closed 5 months ago

wmhutchison commented 8 months ago

Describe the issue Through a combination of factors, AG has not had the chance to properly test a non-production variant of their SILVER application using Red Hat Integration (AMQ Streams) from v2.4.0 to v2.6.0. Thus an inquiry was received to see if we can downgrade the version of AMQ Streams on EMERALD back down to v2.4.0.

Reason for Block AG has agreed to work directly with Red Hat to address their upgrade issues in EMERALD. This ticket will remain open until this is complete as a tracker for any resource requirements from us during this phase.

Additional context Vendor case opened with Red Hat. Noted that AG was using EMERALD as the LAB variant for SILVER, which is not the correct cluster for this scenario, that should be KLAB/CLAB instead. Will push to have the LAB environment moved from EMERALD to KLAB/CLAB after this issue is resolved.

How does this benefit the users of our platform?

Definition of done

wmhutchison commented 7 months ago

Red Hat case was opened late last week regarding this issue. Found an old knowledge base article suggesting that once upon a time it was possible to downgrade this specific operator using OLM, but that is no longer the case now.

For now the only non-destructive method of getting a cluster with a downgraded version of AMQ Streams is to downgrade KLAB2, which matches EMERALD in underlying technologies, and also has no one using it for Integration software yet. AG can then request a license plate for that cluster, copy over their non-upgraded Kafka cluster, and work with us through the upgrade path to get KLAB2 back to similar version as EMERALD. Once all of this is done and SILVER eventually is upgraded to match, then AG can make use of KLAB/CLAB going forward for their official testing platform for SILVER. If they still want to maintain EMERALD as a future PROD environment, then keeping KLAB2 will make sense.

If this is unacceptable, then the sole alternative is to perform an un-install/re-install using OLM, which will nuke all AG Kafka workloads. If they are willing/able to accept this as an alternative, then this would be Plan B. Plan A is getting KLAB2 prepped with the desired older version.

wmhutchison commented 7 months ago

Vendor case mentioned that the current version of AMQ Streams wasn't the very latest available, and that we should apply it first. Applying today during the EMERALD RFC for upgrading Openshift and then will update the case stating this is done and ask for next-steps for David's upgrade issue.

wmhutchison commented 7 months ago

Some back and forth between AG and Red Hat Support has revealed that the root cause for their breakage on EMERALD is because AG is leveraging the installed Operator's CRD's, but then deploying use of that CRD by going through helm on their own versus using the Operator.

Vendor instructions are to modify the operator so that their operator only looks at specific namespaces when processing CR's. Their instructions keep on defaulting to manual manifest files, while we use OLM to install/manage supported operators like this.

To that end, the following docs will be likely the path taken if we pursue this to get the same effect.

https://docs.openshift.com/container-platform/4.12/operators/understanding/olm/olm-understanding-operatorgroups.html#olm-operatorgroups-target-namespace_olm-understanding-operatorgroups

wmhutchison commented 7 months ago

The above link also states that this ability to select multiple namespaces in whatever fashion desired is going to be going away:

Listing multiple namespaces via spec.targetNamespaces or use of a label selector via spec.selector is not recommended, as the support for more than one target namespace in an Operator group will likely be removed in a future release.
StevenBarre commented 6 months ago

Attempted to set the operator to only want select namespaces by editing the OperatorGroup, but that broke the CamelK operator as it doesnt support multinamespace mode.

wmhutchison commented 6 months ago

Issue remains on-going, with AG people continuing to work on removing non-involved artifacts involving Kafka. Platform Operations (DXC) team continues to feed the vendor case with new operator pod logs to help move things along.

wmhutchison commented 6 months ago

Pending AG's creation of such, a Teams invite will likely go out next week to include AG, Platform Operations and Red Hat Support to hopefully move the troubleshooting process forward at a faster pace.

StevenBarre commented 6 months ago

Found that the netpol created by the AMQ operator are being rejected by NCP. Opened a case with VMware.

StevenBarre commented 5 months ago

VMware says we should set enable_mixed_expression_groups = true in the NCP config to get around the error, but the docs also warn this could create performance issues. The performance is of calculating which pods match the netpol if the expressions are to complex.

Testing this option in the KLAB2 cluster with a Kafka object in e5ced5-test. Enabling that setting does allow the operator created network policy to be created.

StevenBarre commented 5 months ago

To complete getting the Kafka cluster to be created, it needed to be adjusted to set the DataClass labels. Just showing the extra bits added in below

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: events-test-e5ced5
  namespace: e5ced5-test
spec:
  entityOperator:
    template:
      pod:
        metadata:
          labels:
            DataClass: Medium
  kafka:
    template:
      pod:
        metadata:
          labels:
            DataClass: Medium
  zookeeper:
    template:
      pod:
        metadata:
          labels:
            DataClass: Medium
StevenBarre commented 5 months ago

The AMQ operator only creates Ingress netpols, but NSX requires that both Ingress and Egress netpols exist. Added these netpols to clear up all the DROPs observed in the NSX firewall logs

---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: advsol-events-test-e5ced5-network-policy-zookeeper
  namespace: e5ced5-test
spec:
  podSelector:
    matchLabels:
      strimzi.io/cluster: events-test-e5ced5
      strimzi.io/kind: Kafka
      strimzi.io/name: events-test-e5ced5-zookeeper
  egress:
    - ports:
        - protocol: TCP
          port: 2888
      to:
        - podSelector:
            matchLabels:
              strimzi.io/cluster: events-test-e5ced5
              strimzi.io/kind: Kafka
              strimzi.io/name: events-test-e5ced5-zookeeper
    - ports:
        - protocol: TCP
          port: 3888
      to:
        - podSelector:
            matchLabels:
              strimzi.io/cluster: events-test-e5ced5
              strimzi.io/kind: Kafka
              strimzi.io/name: events-test-e5ced5-zookeeper
    - ports:
        - protocol: TCP
          port: 2181
      to:
        - podSelector:
            matchLabels:
              strimzi.io/cluster: events-test-e5ced5
              strimzi.io/kind: Kafka
              strimzi.io/name: events-test-e5ced5-zookeeper
  policyTypes:
    - Egress
---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: advsol-events-test-e5ced5-network-policy-kafka
  namespace: e5ced5-test
spec:
  podSelector:
    matchLabels:
      strimzi.io/cluster: events-test-e5ced5
      strimzi.io/kind: Kafka
      strimzi.io/name: events-test-e5ced5-kafka
  egress:
    - ports:
        - protocol: TCP
          port: 2181
      to:
        - podSelector:
            matchLabels:
              strimzi.io/cluster: events-test-e5ced5
              strimzi.io/kind: Kafka
              strimzi.io/name: events-test-e5ced5-zookeeper
    - ports:
        - protocol: TCP
          port: 9090
      to:
        - podSelector:
            matchLabels:
              strimzi.io/cluster: events-test-e5ced5
              strimzi.io/kind: Kafka
              strimzi.io/name: events-test-e5ced5-kafka
    - ports:
        - protocol: TCP
          port: 9091
      to:
        - podSelector:
            matchLabels:
              strimzi.io/cluster: events-test-e5ced5
              strimzi.io/kind: Kafka
              strimzi.io/name: events-test-e5ced5-kafka
  policyTypes:
    - Egress
---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: advsol-events-test-e5ced5-entity-operator
  namespace: e5ced5-test
spec:
  podSelector:
    matchLabels:
      strimzi.io/cluster: events-test-e5ced5
      strimzi.io/kind: Kafka
      strimzi.io/name: events-test-e5ced5-entity-operator
  egress:
    - ports:
        - protocol: TCP
          port: 9091
      to:
        - podSelector:
            matchLabels:
              strimzi.io/cluster: events-test-e5ced5
              strimzi.io/kind: Kafka
              strimzi.io/name: events-test-e5ced5-kafka
    - ports:
        - protocol: TCP
          port: 2181
      to:
        - podSelector:
            matchLabels:
              strimzi.io/cluster: events-test-e5ced5
              strimzi.io/kind: Kafka
              strimzi.io/name: events-test-e5ced5-zookeeper
  policyTypes:
    - Egress
wmhutchison commented 5 months ago

A test Kafka instance in KLAB2 is now realized while the suggested work-around is in place. Unfortunately, VMWare has no viable solution to properly manage the risk involved with this change, meaning it's just not viable to put this work-around in place and leave it there, since there's no way to determine from logging if a future network issue is due to this work-around or caused by something else.

A formal fix is in progress, but not at a stage where an ETA can be announced.

At present, this is going to be a measure of unknown levels of risk due to possible network issues from the work-around versus business needs. More internal discussion required before we follow up again with AG.

StevenBarre commented 5 months ago

Disabled enable_mixed_expression_groups in KLAB2 and tried fixing this via the AMQ/Kafka operator. Added to the Subscription

spec:
  config:
    env:
    - name: STRIMZI_OPERATOR_NAMESPACE_LABELS
      value: kubernetes.io/metadata.name=openshift-bcgov-integration

This caused the operator to adjust the generated netpols to not have namespaceSelector: {} in them, reducing the size of the expression groups.

StevenBarre commented 5 months ago

Fix applied manually to Emerald, and added to CCM https://github.com/bcgov-c/platform-gitops-gen/pull/845

Documented Kafka setup in Emerald https://stackoverflow.developer.gov.bc.ca/a/1219/42

Posted in RC asking teams to test and verify