Support for OKD 3.10, first run fails, second run works

arashkaffamanesh commented 6 years ago

Some minor changes needs to be done to provide support for OKD 3.10, the main changes in inventory.template.cfg are:


openshift_release=v3.10

# Changed for OpenShift 3.10 (filename not needed)
# https://bugzilla.redhat.com/show_bug.cgi?id=1565447

# openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/master/htpasswd'}]

openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]

# Define node groups
openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true']}, {'name': 'node-config-infra', 'labels': ['node-role.kubernetes.io/infra=true']}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true']}]

# host group for nodes, includes region info
[nodes]
${master_hostname} openshift_hostname=${master_hostname} openshift_node_group_name='node-config-master' openshift_schedulable=true
${node1_hostname} openshift_hostname=${node1_hostname} openshift_node_group_name='node-config-compute'
${node2_hostname} openshift_hostname=${node2_hostname} openshift_node_group_name='node-config-compute'

and in install-from-bastion.sh set the branch to release-3.10:

git clone -b release-3.10 https://github.com/openshift/openshift-ansible

But after the first run the following failure summary is shown, but the second run succeeds:

TASK [openshift_storage_glusterfs : load kernel modules] ***********************
fatal: [ip-10-0-1-154.eu-central-1.compute.internal]: FAILED! => {"changed": false, "msg": "Unable to restart service systemd-modules-load.service: Job for systemd-modules-load.service failed because the control process exited with error code. See \"systemctl status systemd-modules-load.service\" and \"journalctl -xe\" for details.\n"}
fatal: [ip-10-0-1-29.eu-central-1.compute.internal]: FAILED! => {"changed": false, "msg": "Unable to restart service systemd-modules-load.service: Job for systemd-modules-load.service failed because the control process exited with error code. See \"systemctl status systemd-modules-load.service\" and \"journalctl -xe\" for details.\n"}
fatal: [ip-10-0-1-123.eu-central-1.compute.internal]: FAILED! => {"changed": false, "msg": "Unable to restart service systemd-modules-load.service: Job for systemd-modules-load.service failed because the control process exited with error code. See \"systemctl status systemd-modules-load.service\" and \"journalctl -xe\" for details.\n"}

RUNNING HANDLER [openshift_node : reload systemd units] ************************
    to retry, use: --limit @/home/ec2-user/openshift-ansible/playbooks/deploy_cluster.retry

PLAY RECAP *********************************************************************
ip-10-0-1-123.eu-central-1.compute.internal : ok=103  changed=51   unreachable=0    failed=1
ip-10-0-1-154.eu-central-1.compute.internal : ok=128  changed=51   unreachable=0    failed=1
ip-10-0-1-29.eu-central-1.compute.internal : ok=103  changed=51   unreachable=0    failed=1
localhost                  : ok=12   changed=0    unreachable=0    failed=0

INSTALLER STATUS ***************************************************************
Initialization              : Complete (0:00:17)
Health Check                : Complete (0:00:38)
Node Bootstrap Preparation  : In Progress (0:02:18)
    This phase can be restarted by running: playbooks/openshift-node/bootstrap.yml

Failure summary:

  1. Hosts:    ip-10-0-1-123.eu-central-1.compute.internal, ip-10-0-1-154.eu-central-1.compute.internal, ip-10-0-1-29.eu-central-1.compute.internal
     Play:     Configure nodes
     Task:     load kernel modules
     Message:  Unable to restart service systemd-modules-load.service: Job for systemd-modules-load.service failed because the control process exited with error code. See "systemctl status systemd-modules-load.service" and "journalctl -xe" for details.

make: *** [openshift] Error 2

Could one confirm this behaviour on his / her side?

It seems this issue https://github.com/dwmkerr/terraform-aws-openshift/issues/40 was already reported:

arashkaffamanesh commented 6 years ago

O.k, it seems there are other problems too, docker-registry-1-deploy and router-1-deploy keep pending:

[ec2-user@ip-10-0-1-154 ~]$ oc get pods
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-1-deploy   0/1       Pending   0          13m
registry-console-1-vq7w8   1/1       Running   1          13m
router-1-deploy            0/1       Pending   0          14m

arashkaffamanesh commented 6 years ago

The reason why docker registry and router are pending is because of missing infra nodes: https://docs.openshift.com/container-platform/3.10/install/configuring_inventory_file.html

If there is not a node in the [nodes] section that matches the selector settings,
the default router and registry will be deployed as failed with Pending status.

dwmkerr commented 6 years ago

Hey @arashkaffamanesh - I got it working:

The key was to update the AMIs to RHEL 7.5 (apparently 7.4 upwards will do). This fixes the kernel module issue. I also updated the code to tag the master node as an infra node (thanks for your tips on this one!)

dwmkerr / terraform-aws-openshift

Support for OKD 3.10, first run fails, second run works #64