equinix / terraform-equinix-metal-openshift-on-baremetal

OpenShift 4.9 Installer for Equinix Metal
https://registry.terraform.io/modules/equinix/openshift-on-baremetal/metal/latest
Apache License 2.0
10 stars 13 forks source link

Support OpenShift 4.9 #8

Closed liveaverage closed 2 years ago

liveaverage commented 2 years ago
liveaverage commented 2 years ago

@displague just let me know if I need to provide any extra detail around changes... figured I'd update while native IPI support progresses :)

displague commented 2 years ago

I've made a change to keep Cloudflare as an optional DNS provider. The latest CloudFlare provider added api_key validation which was blocking other providers from being used.

displague commented 2 years ago

@liveaverage I ran into the following:

│ Error: local-exec provisioner error
│
│   with null_resource.get_kubeconfig,
│   on main.tf line 162, in resource "null_resource" "get_kubeconfig":
│  162:   provisioner "local-exec" {
│
│ Error running command 'mkdir -p ./auth; scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /Users/marques/.ssh/id_rsa_mos-uqjlv root@145.40.102.129:/tmp/artifacts/install/auth/* ./auth/':
│ exit status 1. Output: Warning: Permanently added '145.40.102.129' (ECDSA) to the list of known hosts.
│ scp: /tmp/artifacts/install/auth/*: No such file or directory
│
╵
╷
│ Error: remote-exec provisioner error
│
│   with module.openshift_install.null_resource.check_port,
│   on modules/install/main.tf line 95, in resource "null_resource" "check_port":
│   95:   provisioner "remote-exec" {
│
│ error executing "/tmp/terraform_523679698.sh": Process exited with status 1

Are you getting a different result?

displague commented 2 years ago

Looks like my problem may be with the pull secret which is dated.

less /tmp/artifacts/install/.openshift_install.log

time="2021-11-10T00:36:09-05:00" level=fatal msg="failed to fetch Master Machines: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: pullSecret: Invalid value: \"{ \\\"kind\\\": \\\"Error\\\", \\\"id\\\": \\\"401\\\", \\\"href\\\": \\\"/api/accounts_mgmt/v1/errors/401\\\", \\\"code\\\": \\\"ACCOUNTS-MGMT-401\\\", \\\"reason\\\": \\\"Bearer token is malformed\\\" }\": auths required"
liveaverage commented 2 years ago

Good catch on api_key validation. WRT to pull secret, yes, the bearer token may need to be refreshed, though it's not a frequent requirement.

displague commented 2 years ago

I had some trouble accessing these nodes after the coreos reboots.

liveaverage commented 2 years ago

Are you referring to accessing nodes via SSH or OOB console (SOS)? The latter is a problem that existed in the previous automation, too. Kernel args don't persist, so dropping into SOS requires intercepting boot and adding the appropriate console param back in.

displague commented 2 years ago

@liveaverage would this help? https://github.com/openshift/machine-config-operator/blob/8fa45661c50047de097db3ed7592e48910bfe401/docs/MachineConfiguration.md#kernelarguments

liveaverage commented 2 years ago

Let me retest with day 1 kernel arg updates documented here: https://github.com/openshift/installer/blob/master/docs/user/customization.md#nodes-with-custom-kernel-arguments -- technically we're supposed to make the initial kargs "sticky", but that's still not happening so I'll need to tweak MachineConfigs post-config generation and pre-installl. It can be modified as a day 2 activity as well, but easier for troubleshooting to do this day 1!