IBM-Cloud / terraform-provider-ibm

https://registry.terraform.io/providers/IBM-Cloud/ibm/latest/docs
Mozilla Public License 2.0
342 stars 673 forks source link

Terraform crashed while creating load balancers with protocols #295

Closed sshingarapu closed 5 years ago

sshingarapu commented 6 years ago

Hi there,

Affected Resource(s)

Panic Output

https://gist.github.com/sshingarapu/a9af5f01222d3ecda466e33a946c932e

We are seeing a terraform crash issue (panic output) when sending protocols from modules to LB resources and attaching instances to LB. This has been tested with the latest given terraform ibm provider.( v0.10.0)

Tested without instance attachment and even seeing the terraform crash errors.

module “instance” { // module for instance creation }

module " infranodes01ragsr01-AppExt " { source = "lbaas" protocols = [{ frontend_protocol = "HTTPS" frontend_port = 443 backend_protocol = "HTTP" backend_port = 443 load_balancing_method = "round_robin" tls_certificate_id = 182195 }, { frontend_protocol = "HTTP" frontend_port = 80 backend_protocol = "HTTP" backend_port = 80 load_balancing_method = "round_robin" }, ] }

module " infranodes01ragsr01-AppInt " { source = "lbaas" protocols = [{ frontend_protocol = "TCP" frontend_port = 80 backend_protocol = "TCP" backend_port = 80 load_balancing_method = "round_robin" }, ] }

lbaas/main.tf resource "ibm_lbaas" "lbaas" { name = "terraformLBExample" description = "lbaas example" subnets = ["1511875"] protocols = ["${var.protocols}"] } resource "ibm_lbaas_server_instance_attachment" "server_attach" { count = "${var.count}" // number of instances to attach to LB private_ip_address = "${element(var.private_ip_address,count.index)}" // ipaddresses of instances lbaas_id = "${ibm_lbaas.lbaas.id}" }

Praveengostu commented 6 years ago

The issue is fixed part of https://github.com/IBM-Cloud/terraform-provider-ibm/issues/286 and will be available in next release. You can get the latest code from public and build it for the provider plugin with fix.

sshingarapu commented 6 years ago

@Praveengostu, I have build the provider plugin from public git repository and tried the LB creation with protocols as above. Now I don't see crash issue but the load balancers are not getting created. It is taking more than 1 hr and finally throwing the below error and also i cannot destroy them.

Praveengostu commented 6 years ago

@sshingarapu The default time out is 90m. So it would have taken more than 90m. Usually it creates in less 30m. Pls let us know if it consistently occurs, so that we will take it with the API team

sshingarapu commented 6 years ago

@Praveengostu , Yes. it is happening consistently. Ideally it should not take more than 3-4 mins but taking maximum default time out.

Praveengostu commented 6 years ago

@sshingarapu Could you please share the loadbalancer name and your account number to check this further

sshingarapu commented 6 years ago

@Praveengostu I have tried load balancers with name Test3 and Test4. Now they appeared as online but i got the error when running terraform apply and they were offline for long time. Account number: 357686_shankar.shingarapu@ca.com

sshingarapu commented 6 years ago

@Praveengostu And also health checks are not defined in the load balancers.

Praveengostu commented 6 years ago

Thanks @sshingarapu For healthchecks pls check the resource ibm_lbaas_health_monitor https://ibm-cloud.github.io/tf-ibm-docs/v0.10.0/r/lbaas_health_monitor.html

sshingarapu commented 6 years ago

@Praveengostu I have configured health monitors in terraform file but they did not get created. This issue (health monitors are not getting created) has happened only one time and i think this is because of creation of load balancers are taking time.

Praveengostu commented 6 years ago

@sshingarapu Share the details with the API team.. Will get back to you once we get update from them

Praveengostu commented 6 years ago

@sshingarapu The issue is fixed by the API team.. Please let us know if you see the issue again.

sshingarapu commented 6 years ago

@Praveengostu Thanks for the fix. I will build the provider again and let you know the status.

sshingarapu commented 6 years ago

@Praveengostu I have tested it with new build. Couple of issues are observed. Load balancers are getting created but the health monitors are not able to add to the load balancers. I am getting the below error.

Below is my load balancer main.tf

resource "ibm_lbaas" "lbaas" { name = "${var.name}" subnets = ["1629415"] type = "${var.type}" protocols = ["${var.protocols}"] } resource "ibm_lbaas_server_instance_attachment" "server_attach" { count = "${var.count}" private_ip_address = "${element(var.private_ip_address,count.index)}" lbaas_id = "${ibm_lbaas.lbaas.id}" }

resource "ibm_lbaas_health_monitor" "lbaas_hm" { protocol = "${ibm_lbaas.lbaas.health_monitors.0.protocol}" port = "${ibm_lbaas.lbaas.health_monitors.0.port}" timeout = 3 interval = 5 max_retries = 6 url_path = "/" lbaas_id = "${ibm_lbaas.lbaas.id}" monitor_id = "${ibm_lbaas.lbaas.health_monitors.0.monitor_id}" }

Praveengostu commented 6 years ago

The server_attach and lbaas_hm cannot run in parallel as they will change the status of LB. you can add depends_on to handle this

resource "ibm_lbaas" "lbaas" {
name = "${var.name}"
subnets = ["1629415"]
type = "${var.type}"
protocols = ["${var.protocols}"]
}
resource "ibm_lbaas_server_instance_attachment" "server_attach" {
count = "${var.count}"
private_ip_address = "${element(var.private_ip_address,count.index)}"
lbaas_id = "${ibm_lbaas.lbaas.id}"
}
resource "ibm_lbaas_health_monitor" "lbaas_hm" {
protocol = "${ibm_lbaas.lbaas.health_monitors.0.protocol}"
port = "${ibm_lbaas.lbaas.health_monitors.0.port}"
timeout = 3
interval = 5
max_retries = 6
url_path = "/"
lbaas_id = "${ibm_lbaas.lbaas.id}"
monitor_id = "${ibm_lbaas.lbaas.health_monitors.0.monitor_id}"
depends_on = ["ibm_lbaas_server_instance_attachment.server_attach"]
}
sshingarapu commented 6 years ago

@Praveengostu It worked. But the url_path which is given in above is "/" is not updated in health checks in LB and even i am unable to give manually in the console. Currently there is no value for PATH.

sshingarapu commented 6 years ago

@Praveengostu One observation is the url_path in helath checks is not getting updated when the protocol is TCP.

Praveengostu commented 6 years ago

@sshingarapu There will not be url_path in health checks when the protocol is tcp. Could you please help us with understanding your use case with usage of cloud components.

sshingarapu commented 6 years ago

@Praveengostu We are building openshift environment on bluemix virtual machines. Creating external and internal load balancers with health checks for monitoring purpose. In the load balancers currently we use the protocol TCP. Basically we would like to monitor the health of our environment by defining the health monitors. If TCP has no PATH url then what is the default path will be used for health checks.

sakshiag commented 6 years ago

The health checks against HTTP and TCP ports are conducted as follows:

HTTP: An HTTP GET request against a pre-specified URL is sent to the back-end server port. The server port is marked healthy upon receiving a 200 OK response. The default GET URL is “/” via the GUI, and it can be customized. TCP: The Load Balancer attempts to open a TCP connection with the back-end server on a specified TCP port. The server port is marked healthy if the connection attempt is successful, and the connection is then closed.

You can refer this doc for more details : https://console.bluemix.net/docs/infrastructure/loadbalancer-service/health-checks.html#health-checks

From your previous comment , I see you are trying to setup openshift on bluemix VM. We have also tried doing same with the basic architecture(https://github.com/IBM-Cloud/terraform-ibm-openshift/) using virtual machines and security groups. Can you help us in understanding the approach which you are following and architecture which you are trying to deploy ?

sshingarapu commented 6 years ago

Thanks @sakshiag, @Praveengostu for your help on this. We still have an issue with load balancers, where we see the following error when destroying. Upon issuing another destroy the resource eventually gets deleted. * ibm_lbaas_server_instance_attachment.server_attach.0: Error removing server instances: sl.Error{StatusCode:500, Exception:"SoftLayer_Exception_Network_LBaaS_ObjectInInvalidState", Message:"Load balancer uuid=3d658488-12c1-43da-89f4-dfdf700a6697 cannot be updated. The object is in state UPDATE_PENDING.", Wrapped:error(nil)}

Also there is an intermittent issue with LB creation and server attachment with the following error * module.masternodes01ragsr01-MasterInt.ibm_lbaas_server_instance_attachment.server_attach[0]: 1 error(s) occurred: * ibm_lbaas_server_instance_attachment.server_attach.0: Error adding server instances: sl.Error{StatusCode:500, Exception:"SoftLayer_Exception_Network_LBaaS_ObjectInInvalidState", Message:"Load balancer uuid=13d28131-8176-426a-bf41-30f26a7e2660 cannot be updated. The object is in state UPDATE_PENDING.", Wrapped:error(nil)}

Our architecture is pretty much the same except for the fact that we have multiple masters which are behind a load balancer for obvious reasons. Also, we have more SG's to support our various teams here.

Praveengostu commented 6 years ago

@sshingarapu Thanks for reporting the issue. Will check on this. I assume the issue is recreatable with the same config

resource "ibm_lbaas" "lbaas" {
name = "${var.name}"
subnets = ["1629415"]
type = "${var.type}"
protocols = ["${var.protocols}"]
}
resource "ibm_lbaas_server_instance_attachment" "server_attach" {
count = "${var.count}"
private_ip_address = "${element(var.private_ip_address,count.index)}"
lbaas_id = "${ibm_lbaas.lbaas.id}"
}
resource "ibm_lbaas_health_monitor" "lbaas_hm" {
protocol = "${ibm_lbaas.lbaas.health_monitors.0.protocol}"
port = "${ibm_lbaas.lbaas.health_monitors.0.port}"
timeout = 3
interval = 5
max_retries = 6
url_path = "/"
lbaas_id = "${ibm_lbaas.lbaas.id}"
monitor_id = "${ibm_lbaas.lbaas.health_monitors.0.monitor_id}"
depends_on = ["ibm_lbaas_server_instance_attachment.server_attach"]
}
sshingarapu commented 6 years ago

@Praveengostu Yes. In our case we have 6 load balancers and attaching 2-3 servers in each load balancer. I hope you can reproduce the issue with this configuration.

sshingarapu commented 6 years ago

@Praveengostu Could you please share an update on this? I am getting this issue frequently now.

Praveengostu commented 6 years ago

@sshingarapu Currently looking in to this. Could you please share me your configuration file. Pls enable the debug log by export TF_LOG=debug and share us the log next time you encounter the issue.

Praveengostu commented 6 years ago

@sshingarapu I could not recreate the issue with adding the depends_on between server_attachment and health monitors.. Here is the tf configuration main.tf(https://gist.github.com/Praveengostu/bd732a547b251c120e41dff9ef00366a) where it creates 6 lbaas with each 2 server attachments. Here is the terraform_output(https://gist.github.com/Praveengostu/6634847498fde1dae7e5a6ba82541364) which contains the o/p of apply, show and destroy. Could you please share your configuration and log to understand the issue.

sshingarapu commented 6 years ago

@Praveengostu I have included depends_on in server_attach resource and then i don't see this error frequently. But today i got it again but i missed to set debug. I will set debug in next time and give you the logs if it fails.

resource "ibm_lbaas_server_instance_attachment" "server_attach" { count = "${var.count}" private_ip_address = "${element(var.private_ip_address,count.index)}" lbaas_id = "${ibm_lbaas.lbaas.id}" depends_on = ["ibm_lbaas.lbaas"] }

This is how we configure in our case.

main.tf:

  1. We have module definition for each server(in total 10 servers) and for load balancers (in total 6 load balancers) with different parameters like below module "instance1" { source = "./modules/casaas-terraform-modules/casaas-bmx-instance" private_network_only = "false" hostname = "instance1" }

module "instance2" { source = "./modules/casaas-terraform-modules/casaas-bmx-instance" private_network_only = "true" hostname = "instance2" } // Load Balancers module "infranodes01ragsr01-AppExt" { source = "./modules/casaas-terraform-modules/casaas-bmx-lb" name = "infranodes01ragsr01-AppExt" type = "PUBLIC" count = "2" health_check_interval = "10" health_check_path = "/healthz" health_check_port = "1936" health_check_timeout = "5" health_check_protocol = "HTTP" protocols = [

            {
            frontend_protocol     = "TCP"
            frontend_port         = 443
            backend_protocol      = "TCP"
            backend_port          = 443
            session_stickiness    = "SOURCE_IP"
            load_balancing_method = "round_robin"
            },

    ]

}

module "infranodes01ragsr01-AppInt" { source = "./modules/casaas-terraform-modules/casaas-bmx-lb" name = "infranodes01ragsr01-AppInt" type = "PRIVATE" count = "2" health_check_interval = "10" health_check_path = "/healthz" health_check_port = "1936" health_check_timeout = "5" health_check_protocol = "HTTP" protocols = [

            {
            frontend_protocol     = "TCP"
            frontend_port         = 443
            backend_protocol      = "TCP"
            backend_port          = 443
            session_stickiness    = "SOURCE_IP"
            load_balancing_method = "round_robin"
            },

    ]

}

Could you please try in this way and check if we can reproduce the issue.

sshingarapu commented 6 years ago

@Praveengostu We are getting the issue which we mention earlier while destroying. Here is the terraform ouput with DEBUG enabled https://gist.github.com/sshingarapu/e927f9b2882779e9c94781a8584db719. I will let you know incase if i get the issue while applying.

Praveengostu commented 6 years ago

@sshingarapu I see the module module.masternodes01ragsr01-MasterExt.ibm_lbaas_server_instance_attachment.server_attach[2] fails to attempt to destroy the server attach as the load balancer state is pending. Mostly this occurs if the dependency is missing. Could you please share your configuration so that we can help you with a permanent resolution as we are not able to recreate this.

Praveengostu commented 6 years ago

@sshingarapu One more point is if there are multiple resources of ibm_lbaas_server_instance_attachment there should be a dependency mention between them as each of them changes the state of the Load balancer.

sshingarapu commented 6 years ago

@Praveengostu I have given the configuration details in my earlier comments. Please let me know if that is not enough to debug the issue.

All load balancer module definitions are like the below but with different values for name, type, count etc..

// Load Balancers module "masternodes01ragsr01-MasterExt" { source = "./modules/casaas-terraform-modules/casaas-bmx-lb" name = "ragsr01-MasterExt" lbaas_subnet = "${module.masternodes01.lbaas_subnet}" type = "PUBLIC" count = "3" // number of servers to attach private_ip_address = "${module.masternodes01.private_ips}" // list of server ip's to attach in load balancer health_check_interval = "30" health_check_path = "/healthz" health_check_port = "8443" health_check_timeout = "10" health_check_protocol = "HTTP" protocols = [

            {
            frontend_protocol     = "TCP"
            frontend_port         = 443
            backend_protocol      = "TCP"
            backend_port          = 8443
            session_stickiness    = "SOURCE_IP"
            load_balancing_method = "round_robin"
            },
    ]

}

source: resource "ibm_lbaas" "lbaas" { name = "${var.name}" subnets = ["1629415"] type = "${var.type}" protocols = ["${var.protocols}"] } resource "ibm_lbaas_server_instance_attachment" "server_attach" { count = "${var.count}" // this values if provided in load balancer module definition to attach servers private_ip_address = "${element(var.private_ip_address,count.index)}" // this value is provided in load balancer module definition lbaas_id = "${ibm_lbaas.lbaas.id}" depends_on = ["ibm_lbaas.lbaas"] } resource "ibm_lbaas_health_monitor" "lbaas_hm" { protocol = "${ibm_lbaas.lbaas.health_monitors.0.protocol}" port = "${ibm_lbaas.lbaas.health_monitors.0.port}" timeout = 3 interval = 5 max_retries = 6 url_path = "/" lbaas_id = "${ibm_lbaas.lbaas.id}" monitor_id = "${ibm_lbaas.lbaas.health_monitors.0.monitor_id}" depends_on = ["ibm_lbaas_server_instance_attachment.server_attach"] }

Praveengostu commented 6 years ago

@sshingarapu Sure, Will check and get back to you.

hkantare commented 6 years ago

@sshingarapu Since we are using count in ibm_lbaas_server_instance_attachment they run in parallel and sometimes may be the two or more resources call the delete API at same time and fails with "UPDATE_PENDING" ...One solution to solve the issue by using parallelism terraform destroy -parallelism=1 ..it destroys one by one.

sshingarapu commented 6 years ago

@Praveengostu Can we use parallelism while applying also? We are seeing the below intermittent issue while terraform apply. I have not seen this issue recently but i am sure that i will get this error again.

And, When we use parallelism, does all the resources will be created one by one? If yes then it may take upto 1 hr incase of spinning more VMs.(10-15)

hkantare commented 6 years ago

Yes parallelism can be applied to plan & apply also. When you apply parallelism (1) then all resources will be created one by one. Another approach is to break down terraform apply in to multiple steps 1) terraform apply -target=module.vms -target=modules.xxx (Create all resources which are not dependent on lbass and lbass without parallelism so they will run in parallel) 2)terraform apply -target=module.lbass -parallelism=1 (lbass resources will be created ) 3) terraform apply

sshingarapu commented 6 years ago

I have tried destroy with parallelism and i don't see the load balancer issue but below is the error we always get while destroying. It says no rule with ID of 1733885 exists but it exists in terraform tfstate file.

Debug output updated at https://gist.github.com/sshingarapu/41baad2302d8a21a4d586424e59d6992

module.security_groups.ibm_security_group_rule.outbound_AlertLogicSecurityGroup_80[0] (destroy): 1 error(s) occurred:

hkantare commented 6 years ago

@sshingarapu Thanks for testing with parallelism. Can you please close this issue and open a new issue to track the security groups and rules.Provide the sample configuration you are using for the security group and rules.

hkantare commented 5 years ago

Closing the issue . If issue still exists please reopen it.

ramba07 commented 5 years ago

I got an issue similar to the one posted above, in Bluemix cloud creating an LB and attaching the instances to it through Terraform:

Also the below error:

Request any insight into this.

Thanks in advance.