KLAB2 OCP4 - Complete configuration of KLAB2 cluster

wmhutchison commented 3 years ago

Describe the issue Now that KLAB2 cluster build has matured to the point that involved stakeholders no longer feel the need to rebuild it, work will now proceed to complete the overall configuration process so that the cluster may be viable for use by users involved to assist with testing of the cluster with respect to NSX-T.

Which Sprint Goal is this issue related to?

Reasons) for Being Blocked

firewall rules need to be added to allow SNAT source ranges (one SNAT IP assigned to each namespace by NSX-T) to access the outside Internet.
additional network changes needed since upstream data center doesn't do BGP for CIDR blocks smaller than /24, so we'll need to block the range not allowed on the Internet and get BGP fixed.

Definition of done Checklist (where applicable)

[x] clone a new branch of the platform-ops repo specific for use with KLAB2.
[x] audit existing playbooks with respect to how best to refactor playbooks and role usage to minimize effort spent editing in this branch to achieve the end goals.
[x] apply changed playbooks in KLAB2
[x] Address how best to apply oauth integration without installing CCM.
[x] Confirm Network changes needed to resolve issues have all been completed.
[x] QA the results and troubleshoot as necessary.

wmhutchison commented 3 years ago

Thus far the playbook changes involved will require the following tweaks.

Do not run any portion of the playbooks which require installing purchased certs for API/APP VIP URLs. These are not required for the services involved with Openshift to function, the end users will need to be aware of this lack of certs and plan accordingly. Not hard to un-do if a decision is made later to revisit changing this.
Do not run any portion of the playbooks which require installing Trident/NetApp support, and adjust all PVC locations to prevent usage of storageclass definitions not applicable. The current method of cluster install supports ESXi storage being available natively as a PVC storageclass, will be revisited at a later date when we re-attempt dealing with the issues involving building an NSX-T aware cluster with multiple NICs.

wmhutchison commented 3 years ago

Not quite finished - will spill into Sprint 26 by a few business days.

wmhutchison commented 3 years ago

Blocking for now. While node IP addresses have access to vSphere and outside internet, the SNAT IP ranges which are assigned to each namespace does not have this yet. Dan Deane is aware of this and is submitting appropriate requests to address this problem. Once resolved, completion of cluster config can proceed.

wmhutchison commented 3 years ago

Additional issues discovered and resolved. One being ansible playbooks for generating ignition file for bootstrap, where the following item was not being fulfilled in the playbooks in terms of manually removing specific generated manifest files.

https://docs.openshift.com/container-platform/4.5/installing/installing_vsphere/installing-vsphere.html#installation-user-infra-generate-k8s-manifest-ignition_installing-vsphere

$ rm -f openshift/99_openshift-cluster-api_master-machines-*.yaml openshift/99_openshift-cluster-api_worker-machineset-*.yaml

Not doing this did not show up front as anything bad both for top-level cluster inspection or AlertManager once configured, but the symptom experienced was during execution of post-install config playbooks where master mcp was updated, and updating took forever when it shouldn't. That resulted in a deeper-dive into the output of a must-gather data dump where it was discovered that the cluster kept on attempting to build.confirm new master nodes which it shouldn't, and thus finding three master machine objects and a single worker machineset object which we manually deleted and then all was fine.

wmhutchison commented 3 years ago

Currently there exist docs meant for CCM to configure oauth with a new cluster.

https://github.com/bcgov-c/advsol-docs/blob/master/OCP4/CCM/OAuth.md

Since CCM at the end of the day is just syncing/maintaining Kubernetes objects, we should be able to setup oauth via creation of suitable Kubernetes objects and ansible tasks to install them. Will review CCM to determine how that can best be done, since we'll be using ansible-vault instead of Sealed Secrets to handle the sensitive data.

wmhutchison commented 3 years ago

Dan has filed a priority firewall request to resolve the last network access requirements which involve SNAT IPs assigned to namespaces as the source cannot reach outside to the Internet while node IPs can. Once this is resolved, then will the following work again:

OCP cluster will be able to contact Red Hat servers for info on EFK InstallPlan data, thus allowing us to complete EFK logging installation.
Lingering AlertManager notifications for things like insights and samples operators being able to properly contact Red Hat servers for data/software.
Oauth integration will work and thus allow regular user and devops/platform admin login to KLAB2 via GitHub authentication.

wmhutchison commented 3 years ago

Firewall request was filed and fulfilled, but network access still not working. Opened incident ticket INC0052062 for Network to investigate and resolve. Once resolved, that should unlock completely all remaining blockers for finishing KLAB2..

wmhutchison commented 3 years ago

Net result of Network's due diligence is that the outside router is not advertising the CIDR in question. An emergency change request was entered to rectify this, should be applied in less than two business days, at which time we will revisit and determine if we can remove this item from Blocked.

wmhutchison commented 3 years ago

No longer blocked, can proceed with completion. Will continue into Sprint 27.

wmhutchison commented 3 years ago

Closing off this ticket. Some issues still remain, but stakeholders made the decision to release this cluster into production use while those issues continue being worked on by other stakeholders.

The major pressing issue is Kibana does not work at all, the NSX-T load Balancer consistently generates a 502 Bad Gateway error.

BCDevOps / OpenShift4-RollOut

KLAB2 OCP4 - Complete configuration of KLAB2 cluster #495