IBM-Cloud / terraform-ibm-openshift

Provision IBM Cloud infrastructure with Terraform, and install Red Hat® OpenShift Container Platform 3.
15 stars 41 forks source link

Error when trying to provision OpenShift in IBM Cloud #8

Open trbryant2 opened 5 years ago

trbryant2 commented 5 years ago

I've opened this issue as directed by @Nadine2016. Please refer to the Issue I reported in the another github repository: https://github.com/IBM-Bluemix-Docs/terraform/issues/5#issuecomment-460704601

I'm trying to provision a new OpenShift environment on IBM Cloud infrastructure using the tutorial that may be found at https://github.com/IBM-Bluemix-Docs/terraform/blob/master/tutorials/install_redhat_openshift.md

I receive the following errors when I run the make infrastructure command: Error: Error running plan: 6 error(s) occurred:

module.bastion.output.bastion_domain: Resource 'ibm_compute_vm_instance.bastion' not found for variable 'ibm_compute_vm_instance.bastion.domain'
module.bastion.output.bastion_hostname: Resource 'ibm_compute_vm_instance.bastion' not found for variable 'ibm_compute_vm_instance.bastion.hostname'
module.infranode.output.infra_subnet_id: Resource 'ibm_compute_vm_instance.infranode' not found for variable 'ibm_compute_vm_instance.infranode.0.private_subnet_id'
module.appnode.output.app_subnet_id: Resource 'ibm_compute_vm_instance.appnode' not found for variable 'ibm_compute_vm_instance.appnode.0.private_subnet_id'
module.bastion.output.bastion_private_ip: Resource 'ibm_compute_vm_instance.bastion' not found for variable 'ibm_compute_vm_instance.bastion.ipv4_address_private'
module.bastion.output.bastion_ip_address: Resource 'ibm_compute_vm_instance.bastion' not found for variable 'ibm_compute_vm_instance.bastion.ipv4_address'

make: *** [makefile:3: infrastructure] Error 1 Attached is a screen shot of the errors. lesson2step2

A copy of the variables.tf file is included as well.

variables.tf.txt

@Nadine2016 had asked for the console messages that were displayed when the error occured. Here is a copy of the messages: TutorialErrors.txt

@Nadine2016 had also asked for the cluster id. I'm not sure what she is referencing. The IBM Cloud servers that I'm trying to provision?

Nadine2016 commented 5 years ago

@trbryant2 no worries about the cluster ID. I was assuming you wanted to create a Kubernetes cluster. :-) thanks for posting everything here. The team will look into it.

hkantare commented 5 years ago

@trbryant2 Can you please enable trace by export TF_LOG=debug and provide us the detailed log. If you have any state file terraform.tfstate or terraform.tfstate,backup can you remove them and retry the command make infrastructure by enabling trace

trbryant2 commented 5 years ago

Attached are files that contain what was written to the screen as well as what was captured when I used the make infrastructure | tee /tmp/make.log command. make.log.txt

make_infrastructure_stdout.txt

hkantare commented 5 years ago

The issue is here.. 2019/02/06 03:44:17 [TRACE] vertex 'root.templates.module.templates.data.template_file.infra_host_file_template': walking 2019/02/06 03:44:17 [ERROR] root.masternode: eval: terraform.EvalValidateResource, err: Warnings: []. Errors: [public_vlan_id: cannot parse '' as int: strconv.ParseInt: parsing "https://cloud.ibm.com/classic/network/vlans/2340305": invalid syntax private_vlan_id: cannot parse '' as int: strconv.ParseInt: parsing "https://cloud.ibm.com/classic/network/vlans/2340309": invalid syntax] 2019/02/06 03:44:17 [ERROR] root.masternode: eval: terraform.EvalSequence, err: Warnings: []. Errors: [public_vlan_id: cannot parse '' as int: strconv.ParseInt: parsing "https://cloud.ibm.com/classic/network/vlans/2340305": invalid syntax private_vlan_id: cannot parse '' as int: strconv.ParseInt: parsing "https://cloud.ibm.com/classic/network/vlans/2340309": invalid syntax]

variable vlan_count { description = "Set to 0 if using existing and 1 if deploying a new VLAN" default = "0" }

variable private_vlanid { description = "ID of existing private VLAN to connect VSIs" default = "https://cloud.ibm.com/classic/network/vlans/2340309" }

variable public_vlanid { description = "ID of existing public VLAN to connect VSIs" default = "https://cloud.ibm.com/classic/network/vlans/2340305" }

The values of private_vlanid and public_vlanid shd be "2340309" & "2340305" not the complete url

variable private_vlanid {
description = "ID of existing private VLAN to connect VSIs"
default = "2340309"
}

variable public_vlanid {
description = "ID of existing public VLAN to connect VSIs"
default = "2340305"
}
trbryant2 commented 5 years ago

I tried changing the variables back to that setting and reran the command. Still generated an error. I'm not a Terraform expert, but it looks to me like a security related issue, perhaps with my Softlayer account. Here is the output from when I changed the VLANs back to just the VLAN number.

TutorialErrors.txt

As an experiment, I tried to change the setting of "variable_vlan_count" to 1 instead of re-using the existing VLANs. I wanted to see if there was some security related configuration with VLANs that I'm not aware of. I re-ran the test and still was not successful, although the error messages were different. Here's the result of that test. TutorialErrors_2.txt

trbryant2 commented 5 years ago

I've finally found the cause of my errors. I have been using my IBM email address as my Softlayer USERID. It turns out that I need to use the "User Name" value rather than the email address that I use to login to IBM Cloud. Also, I needed to use the "Classic Infrastructure API Key" rather than the "Platform API Key".

softlayerid

Nadine2016 commented 5 years ago

this is great feedback. I'll make sure to make that more clear in the instructions.

trbryant2 commented 5 years ago

I've been able to successfully complete the make infrastructure. I have since moved on to the make bastion step and have run into issues with packages not matching repository ID's. See the attached file for the errors.

I suspect this is due to the RedHat USERID that I'm logging on with. The reason I say that is that I have a RedHat ID associated with the IBM account number 6059103 as well as a trial developer account. Both cause errors, however the "Pool ID's" that are reported are different between each of the user ID's. Not being well versed with Red Hat and accessing various resources from their package servers, how can I recover from this error and continue with the installation?

TutorialResultsAfter_make_bastion.txt

hkantare commented 5 years ago

Yes the poolID's are different for each user...We are trying to move that parameter to variables.tf than hardcoding script rhn_register.sh https://github.com/IBM-Cloud/terraform-ibm-openshift/blob/master/scripts/rhn_register.sh..As a workaround can you please replace the poolID https://github.com/IBM-Cloud/terraform-ibm-openshift/blob/2f60fa1c2a6a2d337e752061252ba34eda880592/scripts/rhn_register.sh#L17 with our poolID.

Here are the steps to find PoolID easily . Log into bastion node which was provisioned in make infrastructure

[/go/bin/terraform-ibm-openshift #] ssh root@$(terraform output bastion_public_ip)

# Answer "yes" to security questions that are presented on first login

# Result: You should now be logged in as 'root@bastion-ose-<some-hexadecimal>'

[root@bastion-ose-6d6c9f1491 ~]#

# Register the bastion host with Red Hat servers

#You'll now register the bastion host with Red Hat. These next four commands are directly from the first part of the file terraform-ibm-openshift/scripts/rhn_resgister.sh.

subscription-manager unregister

rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release

 # Substitute your RH username and password for the two variables in the following line (the line may wrap):

subscription-manager register --serverurl subscription.rhsm.redhat.com:443/subscription --baseurl cdn.redhat.com --username $uid --password $pswd

subscription-manager list --available --matches '*OpenShift Container Platform

# A 'Pool ID' should be listed in the output. Record that value.
Sample output:

+-------------------------------------------+

    Available Subscriptions

+-------------------------------------------+

Subscription Name:   30 Day Self-Supported Red Hat OpenShift Container Platform, 2-Core Evaluation

Provides:            Red Hat Ansible Engine

                     Red Hat Software Collections (for RHEL Server for IBM Power LE)

                     Red Hat OpenShift Enterprise Infrastructure

                     Red Hat JBoss Core Services

                     Red Hat Enterprise Linux Fast Datapath

                     Red Hat OpenShift Container Platform for Power

                     JBoss Enterprise Application Platform

:

                     Red Hat OpenShift Container Platform Client Tools for Power

                     Red Hat Enterprise Linux Fast Datapath (for RHEL Server for IBM Power LE)

                     Red Hat OpenShift Enterprise JBoss EAP add-on

                     Red Hat OpenShift Container Platform

                     Red Hat Gluster Storage Management Console (for RHEL Server)

                     Red Hat OpenShift Enterprise JBoss A-MQ add-on

                     Red Hat Enterprise Linux for Power, little endian Beta

                     Red Hat OpenShift Enterprise Client Tools

:

                     Red Hat OpenShift Enterprise Application Node

:

                     Red Hat OpenShift Service Mesh

:

                     Red Hat OpenShift Enterprise JBoss FUSE add-on

SKU:                 SER0419

Contract:            11812284

Pool ID:             8a_______obfuscated_______89

Provides Management: Yes

Available:           10

Suggested:           1

Service Level:       Self-Support

Service Type:        L1-L3

Subscription Type:   Stackable

Starts:              12/03/2018

Ends:                01/02/2019

System Type:         Physical
trbryant2 commented 5 years ago

Thank you. I've been able to continue the installation, however I've run into another issue that I would appreciate assistance with.

I started to run the make openshift command. This appeared to make good progress, but after two hours of running I had to leave my workstation for a while and the SSH session timed out after the script paused to post a prompt for my ibm_sl_api_key. When I re-logged into my server and restarted the docker container that was running the command I wasn't sure how to proceed. I tried to re-run the make openshift command but I received the following error.

make_openshift_restart_errror

I'm not sure if I need to restart from scratch, or if there's something I can do to restart at the point where the timeout occurred.

Can you suggest an appropriate action?

hkantare commented 5 years ago

Since the session is terminated in between the terraform lock is not released....One option is to force-unlock by providing option --lock=false at the end for below two lines and run command make openshift

https://github.com/IBM-Cloud/terraform-ibm-openshift/blob/2f60fa1c2a6a2d337e752061252ba34eda880592/makefile#L18

https://github.com/IBM-Cloud/terraform-ibm-openshift/blob/2f60fa1c2a6a2d337e752061252ba34eda880592/makefile#L27 Like below

terraform init && terraform get && terraform apply --target=module.pre_install --lock=false
-----
-----
terraform init && terraform get && terraform apply --target=module.post_install --lock=false
trbryant2 commented 5 years ago

I ran into additional errors and decided to completely restart from scratch. I've been able to successfully reach the point where I was able to start Lesson 3 Step 2, "make openshift" however the installation has hung for over an hour at the point shown in the attached screen shot.

tutorial_error

I'm not sure how to recover from this point.

Also, FWIW, Lesson 3, Step 1 "make rhn_username= rhn_password= bastion" also required me to specify the pool_id in the command string.

trbryant2 commented 5 years ago

Sorry, I didn't mean to close the comment.

hkantare commented 5 years ago

yes we moved pool_id as argument to accept form user instead of hardcording in rhn_register.sh so get the pool_id which you retried earlier in previous comment https://github.com/IBM-Cloud/terraform-ibm-openshift/issues/8#issuecomment-461287008

trbryant2 commented 5 years ago

Yes, I was able to get past that point by including it on the command line. However I was never able to get past the never ending "still creating..." prompts. I tried to kill the process and restart it, but without success. At this point I'm tearing down all of the servers and plan to re-attempt from square one. Since I don't have experience with terraform and ansible, do you have recommendations on how I can troubleshoot issues like these so I don't have to keep bothering you and your staff?

trbryant2 commented 5 years ago

I've just finished recreating the environment and re-ran the tutorial. I've hit the same issue. That is, while running the make openshift command the message stream eventually got stuck for an extended period of time (I finally went to bed in the evening and in the morning found it had stopped) on the screen that you see in the attachment.

How can I troubleshoot and resolve this issue? It seems to be a repeatable error. tutorialerror

hkantare commented 5 years ago

Mostly it can be an issue with SSH session...https://www.a2hosting.in/kb/getting-started-guide/accessing-your-account/keeping-ssh-connections-alive

Can you try to increase the time from the container where you invoke this command

trbryant2 commented 5 years ago

I looked at trying to configure the SSH session, but the link that you provided doesn't appear to apply to what I'm trying to do. If I understand the link correctly, I'll need to know the name of the server that I plan to connect to. Since a number of different servers are created as part of the Terraform playbook, how do I configure the environment such that the timeout values are set for each server?

Nadine2016 commented 5 years ago

@trbryant2 we had another user reporting the same issue. for him it resolved by performing a soft reboot of the affected virtual server. So you would go to infrastructure->devices->device list-> select the affected server and then there is an option to "reboot". when the dialog opens, you can choose a "soft" reboot. the issue seems to be that the RHEL level docker service fails to shutdown which leads to the failure that you see. maybe try it out and see if that helps. thanks

trbryant2 commented 5 years ago

I retried creating the environment. The installation got hung on the "module.pre_install.null_resource.pre_install: Still creating..." step for almost 3 hours. I tried doing the "soft reboot" option on the server. It hung on rebooting and I was never able to ping or reach the server again.

I'm afraid that this tutorial simply isn't reliable enough in it's current form. I cannot recommend this to any colleagues or customers without being able to successfully complete the installation.

Nadine2016 commented 5 years ago

@trbryant2 Could you try rebooting your VSIs again? We think that your issue might be caused by an issue on the IBM Cloud infrastructure side (aka Softlayer) that is not based on the components that you provision as part of this tutorial. If the VSI does not come up, could you raise a support ticket and provide the affected VSI so that our support team can look into it? Instructions are described here: https://cloud.ibm.com/docs/get-support?topic=get-support-getting-customer-support#getting-customer-support we apologize for the frustration and are working on improving the configuration files and the documentation to make it a more smooth experience. If your error is related to an issue that we are not (yet) aware of, we want to know about it to get that fixed as well.

trbryant2 commented 5 years ago

I've been trying to create an OpenShift configuration using two different approaches: Creating the configuration "by hand" using directions from Red Hat. Using the tutorial that I've been working with your team on.

I've finally had success building the configuration by hand.

I've never had luck using the tutorial. And, by attempting to use the tutorial a number of times (I've lost track. At least 5, maybe more times) I've created a number of virtual servers and related assets that have driving my IBM Cloud charges up to an extremely high level. I know that these are internal billings, but it has caught the eye of my management as they are wondering what I've been doing to cause such high charges.

Considering the time that it takes to attempt to use the tutorial to create the configuration (More than half a day for the scripts to run), and the internal charges from provisioning a number of IBM Cloud assets with no success, I really can't afford to spend any more time working on this. My ultimate goal is to document the process for installation IBM Cloud Private on top of Red Hat Openshift. I'm past the first step of provisioning an OpenShift cluster. I now need to focus on installing IBM Cloud Private on that cluster.

Thank you for all of the assistance you've provided. I'm disappointed that I couldn't get things to work, but at least now I have an environment that I can continue my work with.

Regards, Timm R. Bryant Hybrid Cloud Integration Technical Sales Specialist IBM Global Markets

Phone: 1-402-320-9260 E-mail: trbryant@us.ibm.com 2930 Ridge Line Road Lincoln, NE 68516

From: Nadine2016 notifications@github.com To: IBM-Cloud/terraform-ibm-openshift terraform-ibm-openshift@noreply.github.com Cc: trbryant2 trbryant@us.ibm.com, Mention mention@noreply.github.com Date: 03/12/2019 03:32 PM Subject: Re: [IBM-Cloud/terraform-ibm-openshift] Error when trying to provision OpenShift in IBM Cloud (#8)

@trbryant2 Could you try rebooting your VSIs again? We think that your issue might be caused by an issue on the IBM Cloud infrastructure side (aka Softlayer) that is not based on the components that you provision as part of this tutorial. If the VSI does not come up, could you raise a support ticket and provide the affected VSI so that our support team can look into it? Instructions are described here: https://cloud.ibm.com/docs/get-support?topic=get-support-getting-customer-support#getting-customer-support we apologize for the frustration and are working on improving the configuration files and the documentation to make it a more smooth experience. If your error is related to an issue that we are not aware of, we want to know about it to get that fixed as well. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.