BCDevOps / OpenShift4-RollOut

This is the primary board for all activities related to the roll out of OpenShift 4
Apache License 2.0
0 stars 2 forks source link

KLAB2 OCP4 - Complete configuration of KLAB2 cluster #495

Closed wmhutchison closed 3 years ago

wmhutchison commented 3 years ago

Describe the issue Now that KLAB2 cluster build has matured to the point that involved stakeholders no longer feel the need to rebuild it, work will now proceed to complete the overall configuration process so that the cluster may be viable for use by users involved to assist with testing of the cluster with respect to NSX-T.

Which Sprint Goal is this issue related to?

Reasons) for Being Blocked

Definition of done Checklist (where applicable)

wmhutchison commented 3 years ago

Thus far the playbook changes involved will require the following tweaks.

  1. Do not run any portion of the playbooks which require installing purchased certs for API/APP VIP URLs. These are not required for the services involved with Openshift to function, the end users will need to be aware of this lack of certs and plan accordingly. Not hard to un-do if a decision is made later to revisit changing this.

  2. Do not run any portion of the playbooks which require installing Trident/NetApp support, and adjust all PVC locations to prevent usage of storageclass definitions not applicable. The current method of cluster install supports ESXi storage being available natively as a PVC storageclass, will be revisited at a later date when we re-attempt dealing with the issues involving building an NSX-T aware cluster with multiple NICs.

wmhutchison commented 3 years ago

Not quite finished - will spill into Sprint 26 by a few business days.

wmhutchison commented 3 years ago

Blocking for now. While node IP addresses have access to vSphere and outside internet, the SNAT IP ranges which are assigned to each namespace does not have this yet. Dan Deane is aware of this and is submitting appropriate requests to address this problem. Once resolved, completion of cluster config can proceed.

wmhutchison commented 3 years ago

Additional issues discovered and resolved. One being ansible playbooks for generating ignition file for bootstrap, where the following item was not being fulfilled in the playbooks in terms of manually removing specific generated manifest files.

https://docs.openshift.com/container-platform/4.5/installing/installing_vsphere/installing-vsphere.html#installation-user-infra-generate-k8s-manifest-ignition_installing-vsphere

$ rm -f openshift/99_openshift-cluster-api_master-machines-*.yaml openshift/99_openshift-cluster-api_worker-machineset-*.yaml

Not doing this did not show up front as anything bad both for top-level cluster inspection or AlertManager once configured, but the symptom experienced was during execution of post-install config playbooks where master mcp was updated, and updating took forever when it shouldn't. That resulted in a deeper-dive into the output of a must-gather data dump where it was discovered that the cluster kept on attempting to build.confirm new master nodes which it shouldn't, and thus finding three master machine objects and a single worker machineset object which we manually deleted and then all was fine.

wmhutchison commented 3 years ago

Currently there exist docs meant for CCM to configure oauth with a new cluster.

https://github.com/bcgov-c/advsol-docs/blob/master/OCP4/CCM/OAuth.md

Since CCM at the end of the day is just syncing/maintaining Kubernetes objects, we should be able to setup oauth via creation of suitable Kubernetes objects and ansible tasks to install them. Will review CCM to determine how that can best be done, since we'll be using ansible-vault instead of Sealed Secrets to handle the sensitive data.

wmhutchison commented 3 years ago

Dan has filed a priority firewall request to resolve the last network access requirements which involve SNAT IPs assigned to namespaces as the source cannot reach outside to the Internet while node IPs can. Once this is resolved, then will the following work again:

wmhutchison commented 3 years ago

Firewall request was filed and fulfilled, but network access still not working. Opened incident ticket INC0052062 for Network to investigate and resolve. Once resolved, that should unlock completely all remaining blockers for finishing KLAB2..

wmhutchison commented 3 years ago

Net result of Network's due diligence is that the outside router is not advertising the CIDR in question. An emergency change request was entered to rectify this, should be applied in less than two business days, at which time we will revisit and determine if we can remove this item from Blocked.

wmhutchison commented 3 years ago

No longer blocked, can proceed with completion. Will continue into Sprint 27.

wmhutchison commented 3 years ago

Closing off this ticket. Some issues still remain, but stakeholders made the decision to release this cluster into production use while those issues continue being worked on by other stakeholders.

The major pressing issue is Kibana does not work at all, the NSX-T load Balancer consistently generates a 502 Bad Gateway error.