dwmkerr / terraform-aws-openshift

Create infrastructure with Terraform and AWS, install OpenShift. Party!
http://www.dwmkerr.com/get-up-and-running-with-openshift-on-aws
MIT License
170 stars 173 forks source link

Would like to get this working with OpenShift Origin 3.7 #27

Closed rberlind closed 6 years ago

rberlind commented 6 years ago

When I try to use release-3.7 or release-3.7.0day by specifying them in the git clone command inside install-from-bastion.sh, I end up getting error at the end of the Ansible run:

TASK [template_service_broker : Reconcile with RBAC file] ** fatal: [master.openshift.local]: FAILED! => {"changed": true, "cmd": "oc process -f \"/tmp/tsb-ansible-keZijh/rbac-template.yaml\" | oc auth reconcile -f -", "delta": "0:00:00.285904", "end": "2017-11-29 12:45:42.125009", "failed": true, "rc": 1, "start": "2017-11-29 12:45:41.839105", "stderr": "Error: unknown shorthand flag: 'f' in -f\n\n\nUsage:\n oc auth [options]\n\nAvailable Commands:\n can-i Check whether an action is allowed\n\nUse \"oc --help\" for more information about a given command.\nUse \"oc options\" for a list of global command-line options (applies to all commands).", "stderr_lines": ["Error: unknown shorthand flag: 'f' in -f", "", "", "Usage:", " oc auth [options]", "", "Available Commands:", " can-i Check whether an action is allowed", "", "Use \"oc --help\" for more information about a given command.", "Use \"oc options\" for a list of global command-line options (applies to all commands)."], "stdout": "", "stdout_lines": []} to retry, use: --limit @/home/ec2-user/openshift-ansible/playbooks/byo/config.retry

This seems related to https://github.com/openshift/openshift-ansible/issues/6086.

I tried fixing the commit to 56b529e (which someone on that ticket said fixed the problem) by running git checkout 56b529e after the git clone command, but I got the same error.

Can anyone suggest a workaround to get this working with OpenShift Origin 3.7? The problem is not with Terraform itself, but with the openshift-ansible code.

rberlind commented 6 years ago

I should add that I don't know ansible at all. I was unsure what the instruction at the end about retrying with --limit @/home/ec2-user/openshift-ansible/playbooks/byo/config.retry meant. Does it mean retry the single oc process -f "/tmp/tsb-ansible-keZijh/rbac-template.yaml" | oc auth reconcile -f - command and add the extra part? Or does it mean to retry the entire make openshift command or something else?

Also, the end of the Installer Status section has "This phase can be restarted by running: playbooks/byo/openshift-cluster/service-catalog.yml". Perhaps I should run ansible-playbook -i ./inventory.cfg ./openshift-ansible/playbooks/byo/openshift-cluster/service-catalog.yml?

rberlind commented 6 years ago

I tried adding openshift_repos_enable_testing=true to inventory.template.cfg file as suggested in response to https://github.com/openshift/openshift-ansible/issues/6086. That did get past the RBAC error, but I then saw:

FAILED - RETRYING: Verify that TSB is running (1 retries left). fatal: [master.openshift.local]: FAILED! => {"attempts": 120, "changed": false, "cmd": ["curl", "-k", "https://apiserver.openshift-template-service-broker.svc/healthz"], "delta": "0:00:01.010827", "end": "2017-12-01 13:09:25.018259", "failed": true, "msg": "non-zero return code", "rc": 7, "start": "2017-12-01 13:09:24.007432", "stderr": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused", "stderr_lines": [" % Total % Received % Xferd Average Speed Time Time Time Current", " Dload Upload Total Spent Left Speed", "", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0", " 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused"], "stdout": "", "stdout_lines": []}

Several thing occur to me: 1) The DNS name apiserver.openshift-template-service-broker.svc is not known 2) The port perhaps has to be 8443 Note that adding the DNS to /etc/hosts and then using port 8443 from master, gave me back "ok".

Additionally, the advanced installation docs for OpenShift Origin indicate that one is supposed to set openshift_template_service_broker_namespaces if enabling the template service broker. See https://docs.openshift.org/latest/install_config/install/advanced_install.html#configuring-template-service-broker.

But I tried adding openshift_template_service_broker_namespaces=['openshift'] and still got the same error.

Now, I'm going to try disabling the service catalog and template service broker with:

openshift_enable_service_catalog=false template_service_broker_install=false

I'm also going to explicitly set the ports with: openshift_master_api_port=8443 openshift_master_console_port=8443

rberlind commented 6 years ago

That did not work either, but I now think I had added the new variables in the wrong part of the file, under [nodes] instead of under [OSEv3:vars]. I will retry.

rberlind commented 6 years ago

Unfortunately, even when I put the variables in the right place, the TSB running could not be verified. However, good news is that I was able to install OpenShift Origin 3.7 by disabling the service catalog and TSB with:

openshift_enable_service_catalog=false template_service_broker_install=false

One other note for you: I think you should technically include "etcd" under [OSEv3:children] at top of the inventory template.

dwmkerr commented 6 years ago

Hmm OK cool I'll take a look at this, thanks for sharing @rberlind!

rberlind commented 6 years ago

No problem. Thanks for putting together this repo. It was very helpful to me.

ghost commented 6 years ago

Hi,

did anyone solve this issue so far? The service catalog is a viable feature…

mtbvang commented 6 years ago

Hi,

Are you guys still having this problem? Release 3.7 worked for me.

rberlind commented 6 years ago

I have not used recently and had worked around it. Roger

On Tue, Feb 13, 2018 at 3:20 PM, Vang Nguyen notifications@github.com wrote:

Hi,

Are you guys still having this problem? Release 3.7 worked for me.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dwmkerr/terraform-aws-openshift/issues/27#issuecomment-365391704, or mute the thread https://github.com/notifications/unsubscribe-auth/AWhd9PihVC-zZpAf14_3Njxy9nVfbSX0ks5tUe6MgaJpZM4Qvuki .

ghost commented 6 years ago

I'm on holidays at the moment and have no access to the git repository of my company. But I've worked around this issue by making a small change to the playbook. Please ask me again in the mid of march if you need some infos.

dwmkerr commented 6 years ago

Looking at this as well at the moment, note to self:

https://docs.openshift.org/latest/install_config/configuring_aws.html#aws-cluster-labeling

May need to update the labelling logic introduced in #33

Also check this for notes on dynamic aws tag names (particularly the limitations for how we can manage this in terraform):

https://github.com/hashicorp/terraform/issues/14516#issuecomment-301630345

dwmkerr commented 6 years ago

Hi @yves-vogl @rberlind @mtbvang,

Can you let me know the changes you had to make to get this to work? At the moment, when I try to install 3.7, I always get this issue:

RUNNING HANDLER [openshift_master : restart master controllers] ****************
        to retry, use: --limit @/home/ec2-user/openshift-ansible/playbooks/byo/config.retry

PLAY RECAP *********************************************************************
ip-10-0-1-137.ec2.internal : ok=54   changed=8    unreachable=0    failed=0
ip-10-0-1-44.ec2.internal  : ok=294  changed=106  unreachable=0    failed=1
localhost                  : ok=11   changed=0    unreachable=0    failed=0

INSTALLER STATUS ***************************************************************
Initialization             : Complete
Health Check               : Complete
etcd Install               : Complete
Master Install             : In Progress
        This phase can be restarted by running: playbooks/byo/openshift-master/config.yml

Failure summary:

  1. Hosts:    ip-10-0-1-44.ec2.internal
     Play:     Configure masters
     Task:     restart master api
     Message:  Unable to restart service origin-master-api: Job for origin-master-api.service failed because the control process exited with error code. See "systemctl status origin-master-api.service" and "journalctl -xe" for details.

My current work-in-progress branch for this is here (I've opened a PR to make it easy to see the changes):

https://github.com/dwmkerr/terraform-aws-openshift/pull/43

The key changes so far are:

  1. Set openshift_clusterid=${cluster_id} in the playbook see here
  2. Set the new tags required for OC 3.7 see here

That's basically it. I've found the following issues which seem to be potential causes:

I've attempted the following workarounds:

  1. Explicitly setting etcd_version=3.1.9
  2. Explicitly setting etcd_version=3.2.7
  3. Explicitly setting the SDN CIDR to one which will not overlap with the VPC CIDR (osm_cluster_network_cidr=11.0.0.0/16)

So far no luck. Any pointers would be super helpful!

rberlind commented 6 years ago

Well, I cheated by disabling some things. Specifically, I was able to install OpenShift Origin 3.7 by disabling the service catalog and TSB with:

openshift_enable_service_catalog=false template_service_broker_install=false

I had also set openshift_repos_enable_testing=true in the inventory.template.cfg file.

sumitshatwara commented 6 years ago

I also faced same issue around service catalog. I disabled below parameters as mentioned by @rberlind in host inventory file:

openshift_enable_service_catalog=false template_service_broker_install=false

And the deployment of OpenShift Origin 3.7 was successful:

INSTALLER STATUS *** Initialization : Complete Health Check : Complete etcd Install : Complete Master Install : Complete Master Additional Install : Complete Node Install : Complete Hosted Install : Complete

My question: Is service catalog a mandatory feature for OpenShift environment? Use Case: I want to test FlexVolume driver of K8s only.

dwmkerr commented 6 years ago

@rberlind I've tried this just now but no joy! Any chance you can share your inventory so I can take a look?

@sumitshatwara You should be fine - the service catalog is an optional feature and you can test volumes without it, let me know how it goes!!

mtbvang commented 6 years ago

@dwmkerr

I just applied my playbook again for openshift 3.7 for the modified version that runs on centos instead of rhel and it worked. I did get it running on rhel before making the changes to centos. I'm not hitting the issues that everyone else is so I'm a bit confused. The code I'm working with is in the centos branch in my fork.

Lower down I list my development setup and the few changes that I made in commit https://github.com/mtbvang/terraform-aws-openshift/commit/8445f0867be372c7b6fafbad00e6588d50221ee7. Git did a few weird things with the image files and I ended up committing them again, maybe because I'm working in a vagrant VM. Here's the output from the run and address of the cluster https://54.93.200.118.xip.io:8443

The only other thing I can see is I make some ssh key changes to name the key to terraform-aws-openshift. I work in a vagrant VM and this key is copied into the guest VM where I run terraform from. Below is more details about my vagrant setup. These are the differences I see between what I've done:

PLAY RECAP *********************************************************************
localhost                  : ok=12   changed=0    unreachable=0    failed=0   
master.openshift.local     : ok=644  changed=265  unreachable=0    failed=0   
node1.openshift.local      : ok=191  changed=65   unreachable=0    failed=0   
node2.openshift.local      : ok=191  changed=65   unreachable=0    failed=0   

INSTALLER STATUS ***************************************************************
Initialization             : Complete
Health Check               : Complete
etcd Install               : Complete
Master Install             : Complete
Master Additional Install  : Complete
Node Install               : Complete
Hosted Install             : Complete
Service Catalog Install    : Complete

# Now the installer is done, run the postinstall steps on each host.
cat ./scripts/postinstall-master.sh | ssh -A ec2-user@$(terraform output bastion-public_dns) ssh centos@master.openshift.local
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added the RSA host key for IP address '10.0.1.185' to the list of known hosts.
Adding password for user admin
cat ./scripts/postinstall-node.sh | ssh -A ec2-user@$(terraform output bastion-public_dns) ssh centos@node1.openshift.local
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added the RSA host key for IP address '10.0.1.155' to the list of known hosts.
cat ./scripts/postinstall-node.sh | ssh -A ec2-user@$(terraform output bastion-public_dns) ssh centos@node2.openshift.local
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added the RSA host key for IP address '10.0.1.150' to the list of known hosts.

oc version on master

oc v3.7.0+7ed6862
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://master.openshift.local:8443
openshift v3.7.0+7ed6862
kubernetes v1.7.6+a08f5eeb62

In my fork my current setup is in the centos branch. I've added a vagrant folder which has a vagrant file that spins up a centos development VM that installs the aws client tool version 1.14.36, terrafrom 0.11.3, and openshift client v3.7.0. You'll need to build the vagrant box using packer. There's a packer task in the vagrant/build.gradle file that can be run with with gradle wrapper from your host:

./gradlew packer

Once packer finishesf

vagrant ssh
cd /vagrant
terraform init
terraform get
terraform plan
terraform apply
make openshift

I hope this helps in figuring this out, or ping me if you need any more information.

piyushkv1 commented 6 years ago

I'm also seeing the issue with 3.7 version. TASK [template_service_broker : Verify that TSB is running] ** FAILED - RETRYING: Verify that TSB is running (120 retries left). FAILED - RETRYING: Verify that TSB is running (2 retries left). FAILED - RETRYING: Verify that TSB is running (1 retries left). fatal: [openshift.node.1]: FAILED! => {"attempts": 120, "changed": false, "cmd": ["curl", "-k", "https://apiserver.openshift-template-service-broker.svc/healthz"], "delta": "0:00:01.036646", "end": "2018-02-25 14:29:54.090730", "msg": "non-zero return code", "rc": 7, "start": "2018-02-25 14:29:53.054084", "stderr": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused", "stderr_lines": [" % Total % Received % Xferd Average Speed Time Time Time Current", " Dload Upload Total Spent Left Speed", "", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0", " 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused"], "stdout": "", "stdout_lines": []}

rberlind commented 6 years ago

Hi @dwmkerr, I've attached my inventory.template.cfg file and my install-from-bastion.sh script.

I did this back in December, but I believe the key changes I made were the following:

inventory.template.cfg:

#Enable use of testing repos so that 3.7 will be used
#Note that this was before 3.7 was released, so it might not be needed anymore.
openshift_repos_enable_testing=true

openshift_enable_service_catalog=false
template_service_broker_install=false

install-from-bastion.sh:

# I was cloning from my own fork of openshift-ansible, but I think you should be able to use
# https://github.com/openshift/openshift-ansible
git clone -b release-3.7 https://github.com/rberlind/openshift-ansible

inventory-and-script.zip

junsionzhang commented 6 years ago

@rberlind hi, do you have tried the latest version now ? for me now , same version ,same problem

rberlind commented 6 years ago

I have not @junsionzhang . I have started working with this again, but have been creating a quite different version in which I trigger the ansible-playbook and all other installation steps with terraform remote-exec provisioners. I have not put any of this in Github yet.

stanvarlamov commented 6 years ago

Suggesting to bypass 3.7 for those who are not tied to a particular version and just want to build a working cluster on a supported version of OpenShift.

3.7 seems to have a number of packaging issues, and the 3.9 release created additional problems for the 3.7 install.

Things to change in the 3.7 branch here so that it can be used for the 3.9 install (amazingly, just a few):

  1. Fix 00-tags.tf: remove the obsolete "KubernetesCluster", "${var.cluster_id}", or change it to "KubernetesCluster", "${var.cluster_name}",

  2. Update inventory.template.cfg:

    openshift_deployment_type=origin
    openshift_release=v3.9
  3. In install-from-bastion, change the clone version to 3.9 and replace the ansible call with these two:

    ANSIBLE_HOST_KEY_CHECKING=False /usr/bin/ansible-playbook -i ./inventory.cfg ./openshift-ansible/playbooks/prerequisites.yml
    ANSIBLE_HOST_KEY_CHECKING=False /usr/bin/ansible-playbook -i ./inventory.cfg ./openshift-ansible/playbooks/deploy_cluster.yml
rberlind commented 6 years ago

Interesting that you mention the need to add "openshift_release=v3.9", @stanvarlamov. I just hit this the day before yesterday when I started getting errors about short_version 3.9 not being valid when using the 3.7 version of openshift-ansible. To keep using that version, I had to set openshift_release=v3.7. I also noticed that the documentation suggested using "openshift_deployment_type" instead of "deployment_type" and also made that change.

By the way, another change that should be made is that the provisioning of the aws_instances should use "vpc_security_group_ids" instead of "security_groups" so that subsequent applies will not trigger destroy/create against the EC2 instances. See https://www.terraform.io/docs/providers/aws/r/instance.html#security_groups.

stanvarlamov commented 6 years ago

I think in the spirit of this repo being an excellent source of procedures to get a working OpenShift version up on AWS with a basic configuration - we should move the master branch to 3.9 as that has been released recently. I find the 3.9 install process much improved compare to 3.6 and 3.7, and overall 3.9 features and look&feel more appealing. That is, basically, my suggestion.

Spengreb commented 6 years ago

@stanvarlamov I think you're right on this one.

A few months back I forked this project to make a multi-master setup using Centos instead of RHEL. Got all that working and noticed you guys added dynamic pv support so i merged with mine, i should mention as well before this i was using 3.7 just fine with few issues. After the merge a lot went wrong for me which i discovered you guys had already been through, some seemingly random stuff was going wrong too, but ended up here in this thread. I switched to 3.9 release instead and I got a lot further.

I managed to get this project working with Centos, 2 Master nodes, 3 compute nodes (though not in ideal setup due to budget constraints), OpenShift 3.9 with Metrics and dynamic PV support.

Still testing to make sure everything works ok. But can do a PR if you want to take a look, though i may have changed too much as its very specific to my needs.

stanvarlamov commented 6 years ago

@Spengreb OpenShift 3.9 appears to be much faster than 3.6-7 and more stanble; PV resize also seems to be an important feature available. Centos, metrics and logging work as a 1-click install - pretty amazing, considering the amount of time it took to work through 3.6 Ansible bugs and inconsistencies. Dynamic PV based on EBS is a sore point, through. It kind of assumes you are single-AZ, which defeats the purpose of HA in a multi-AZ setup, so I don't consider that a production-ready feature at this point. But, overall, I think that 3.9 is really a game changer. Highly recommended.

dwmkerr commented 6 years ago

@stanvarlamov @Spengreb yep agreed! Will start on the 3.9 setup shortly. 3.7 has been a total pain to get working, so I'm all in favour of skipping it for now, if for some reason someone really needs 3.7 we can always go back and try again when some more of the ansible issues are sorted, but for now 3.9 sounds like a sensible option!

dwmkerr commented 6 years ago

I've updated master to install 3.9, I've also raised:

https://github.com/dwmkerr/terraform-aws-openshift/issues/48

To track the issue @rberlind mentioned about the secutity_groups setting.

If this works then let me know guys and I'll close the issue!

rberlind commented 6 years ago

you can close.