Closed wmhutchison closed 5 months ago
Red Hat case was opened late last week regarding this issue. Found an old knowledge base article suggesting that once upon a time it was possible to downgrade this specific operator using OLM, but that is no longer the case now.
For now the only non-destructive method of getting a cluster with a downgraded version of AMQ Streams is to downgrade KLAB2, which matches EMERALD in underlying technologies, and also has no one using it for Integration software yet. AG can then request a license plate for that cluster, copy over their non-upgraded Kafka cluster, and work with us through the upgrade path to get KLAB2 back to similar version as EMERALD. Once all of this is done and SILVER eventually is upgraded to match, then AG can make use of KLAB/CLAB going forward for their official testing platform for SILVER. If they still want to maintain EMERALD as a future PROD environment, then keeping KLAB2 will make sense.
If this is unacceptable, then the sole alternative is to perform an un-install/re-install using OLM, which will nuke all AG Kafka workloads. If they are willing/able to accept this as an alternative, then this would be Plan B. Plan A is getting KLAB2 prepped with the desired older version.
Vendor case mentioned that the current version of AMQ Streams wasn't the very latest available, and that we should apply it first. Applying today during the EMERALD RFC for upgrading Openshift and then will update the case stating this is done and ask for next-steps for David's upgrade issue.
Some back and forth between AG and Red Hat Support has revealed that the root cause for their breakage on EMERALD is because AG is leveraging the installed Operator's CRD's, but then deploying use of that CRD by going through helm on their own versus using the Operator.
Vendor instructions are to modify the operator so that their operator only looks at specific namespaces when processing CR's. Their instructions keep on defaulting to manual manifest files, while we use OLM to install/manage supported operators like this.
To that end, the following docs will be likely the path taken if we pursue this to get the same effect.
The above link also states that this ability to select multiple namespaces in whatever fashion desired is going to be going away:
Listing multiple namespaces via spec.targetNamespaces or use of a label selector via spec.selector is not recommended, as the support for more than one target namespace in an Operator group will likely be removed in a future release.
Attempted to set the operator to only want select namespaces by editing the OperatorGroup, but that broke the CamelK operator as it doesnt support multinamespace mode.
Issue remains on-going, with AG people continuing to work on removing non-involved artifacts involving Kafka. Platform Operations (DXC) team continues to feed the vendor case with new operator pod logs to help move things along.
Pending AG's creation of such, a Teams invite will likely go out next week to include AG, Platform Operations and Red Hat Support to hopefully move the troubleshooting process forward at a faster pace.
Found that the netpol created by the AMQ operator are being rejected by NCP. Opened a case with VMware.
VMware says we should set enable_mixed_expression_groups = true
in the NCP config to get around the error, but the docs also warn this could create performance issues. The performance is of calculating which pods match the netpol if the expressions are to complex.
Testing this option in the KLAB2 cluster with a Kafka object in e5ced5-test. Enabling that setting does allow the operator created network policy to be created.
To complete getting the Kafka cluster to be created, it needed to be adjusted to set the DataClass labels. Just showing the extra bits added in below
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: events-test-e5ced5
namespace: e5ced5-test
spec:
entityOperator:
template:
pod:
metadata:
labels:
DataClass: Medium
kafka:
template:
pod:
metadata:
labels:
DataClass: Medium
zookeeper:
template:
pod:
metadata:
labels:
DataClass: Medium
The AMQ operator only creates Ingress netpols, but NSX requires that both Ingress and Egress netpols exist. Added these netpols to clear up all the DROPs observed in the NSX firewall logs
---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: advsol-events-test-e5ced5-network-policy-zookeeper
namespace: e5ced5-test
spec:
podSelector:
matchLabels:
strimzi.io/cluster: events-test-e5ced5
strimzi.io/kind: Kafka
strimzi.io/name: events-test-e5ced5-zookeeper
egress:
- ports:
- protocol: TCP
port: 2888
to:
- podSelector:
matchLabels:
strimzi.io/cluster: events-test-e5ced5
strimzi.io/kind: Kafka
strimzi.io/name: events-test-e5ced5-zookeeper
- ports:
- protocol: TCP
port: 3888
to:
- podSelector:
matchLabels:
strimzi.io/cluster: events-test-e5ced5
strimzi.io/kind: Kafka
strimzi.io/name: events-test-e5ced5-zookeeper
- ports:
- protocol: TCP
port: 2181
to:
- podSelector:
matchLabels:
strimzi.io/cluster: events-test-e5ced5
strimzi.io/kind: Kafka
strimzi.io/name: events-test-e5ced5-zookeeper
policyTypes:
- Egress
---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: advsol-events-test-e5ced5-network-policy-kafka
namespace: e5ced5-test
spec:
podSelector:
matchLabels:
strimzi.io/cluster: events-test-e5ced5
strimzi.io/kind: Kafka
strimzi.io/name: events-test-e5ced5-kafka
egress:
- ports:
- protocol: TCP
port: 2181
to:
- podSelector:
matchLabels:
strimzi.io/cluster: events-test-e5ced5
strimzi.io/kind: Kafka
strimzi.io/name: events-test-e5ced5-zookeeper
- ports:
- protocol: TCP
port: 9090
to:
- podSelector:
matchLabels:
strimzi.io/cluster: events-test-e5ced5
strimzi.io/kind: Kafka
strimzi.io/name: events-test-e5ced5-kafka
- ports:
- protocol: TCP
port: 9091
to:
- podSelector:
matchLabels:
strimzi.io/cluster: events-test-e5ced5
strimzi.io/kind: Kafka
strimzi.io/name: events-test-e5ced5-kafka
policyTypes:
- Egress
---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: advsol-events-test-e5ced5-entity-operator
namespace: e5ced5-test
spec:
podSelector:
matchLabels:
strimzi.io/cluster: events-test-e5ced5
strimzi.io/kind: Kafka
strimzi.io/name: events-test-e5ced5-entity-operator
egress:
- ports:
- protocol: TCP
port: 9091
to:
- podSelector:
matchLabels:
strimzi.io/cluster: events-test-e5ced5
strimzi.io/kind: Kafka
strimzi.io/name: events-test-e5ced5-kafka
- ports:
- protocol: TCP
port: 2181
to:
- podSelector:
matchLabels:
strimzi.io/cluster: events-test-e5ced5
strimzi.io/kind: Kafka
strimzi.io/name: events-test-e5ced5-zookeeper
policyTypes:
- Egress
A test Kafka instance in KLAB2 is now realized while the suggested work-around is in place. Unfortunately, VMWare has no viable solution to properly manage the risk involved with this change, meaning it's just not viable to put this work-around in place and leave it there, since there's no way to determine from logging if a future network issue is due to this work-around or caused by something else.
A formal fix is in progress, but not at a stage where an ETA can be announced.
At present, this is going to be a measure of unknown levels of risk due to possible network issues from the work-around versus business needs. More internal discussion required before we follow up again with AG.
Disabled enable_mixed_expression_groups
in KLAB2 and tried fixing this via the AMQ/Kafka operator. Added to the Subscription
spec:
config:
env:
- name: STRIMZI_OPERATOR_NAMESPACE_LABELS
value: kubernetes.io/metadata.name=openshift-bcgov-integration
This caused the operator to adjust the generated netpols to not have namespaceSelector: {}
in them, reducing the size of the expression groups.
Fix applied manually to Emerald, and added to CCM https://github.com/bcgov-c/platform-gitops-gen/pull/845
Documented Kafka setup in Emerald https://stackoverflow.developer.gov.bc.ca/a/1219/42
Posted in RC asking teams to test and verify
Describe the issue Through a combination of factors, AG has not had the chance to properly test a non-production variant of their SILVER application using Red Hat Integration (AMQ Streams) from v2.4.0 to v2.6.0. Thus an inquiry was received to see if we can downgrade the version of AMQ Streams on EMERALD back down to v2.4.0.
Reason for Block AG has agreed to work directly with Red Hat to address their upgrade issues in EMERALD. This ticket will remain open until this is complete as a tracker for any resource requirements from us during this phase.
Additional context Vendor case opened with Red Hat. Noted that AG was using EMERALD as the LAB variant for SILVER, which is not the correct cluster for this scenario, that should be KLAB/CLAB instead. Will push to have the LAB environment moved from EMERALD to KLAB/CLAB after this issue is resolved.
How does this benefit the users of our platform?
Definition of done