aws-samples / aws-refarch-wordpress

This reference architecture provides best practices and a set of YAML CloudFormation templates for deploying WordPress on AWS.
MIT No Attribution
1.08k stars 609 forks source link

ELB Health Check Issue #7

Closed michael-newman closed 6 years ago

michael-newman commented 6 years ago

When I run the Master Template, it fails and does a rollback at the point of the Web Template. As such, I ran the Templates individually successfully until I run the Web Template where it fails and does a rollback (of just the web template).

The issue is new instances fail the Health Check, and as a result, the ASG launches another EC2 instance, another fails, it launches a new instance, etc... hence, causing the Web Template to fail and rollback.

As a result, I changed the Web Template from "HealthCheckType: ELB" to "HealthCheckType: EC2" and the Web Template runs successfully to completion and the ASG does not continually launch instances. As such, it appears the Heath Check is not visible from the ELB as it should be.

Aside from the above modification, the only other modification I made to the Templates was changing the subnet IP ranges from 10.0.x.x/xx to 10.10.x.x/xx. in the VPC Template.

Is there some other configuration I need to make as a result of adjusting the CIDR ranges? or are there any other configuration I need to make in the AWS Console for these Templates to execute in totality?

Would appreciate any/all guidance and thoughts re how to get the Health Check from the ELB working.

Thank you, Mike

michael-newman commented 6 years ago

Since the Web Template only completed with "HealthCheckType: EC2" I decided to test/change the ASG configuration in the console to "HealthCheckType: ELB"... Interesting, I have no issues now--I can adjust the ASG instance settings up/down and instances are launching/terminating as expected.

darrylsosborne commented 6 years ago

I'll take a look at this over the next few days and see if a change to resolve this will be pushing during my next push in the next few days.

michael-newman commented 6 years ago

Darryl, Know I'm currently working with AWS Tech Support as it appears the issue I was having with "HealthCheckType: ELB" are timeout issues due to latency within the site. Not sure why, yet, but the site is slow and that was causing HealthChecks to timeout/fail. If the root cause is something related to the CloudFormation Template (versus our wordpress site) I'll post it here.

darrylsosborne commented 6 years ago

Michael. Have you been able to resolve your ELB healthcheck issue?

michael-newman commented 6 years ago

Darryl,

Not yet... we have AWS techs reviewing ALB log files and Web Server packet captures as we speak--should have an answer soon.

-Mike

michael-newman commented 6 years ago

Darryl,

FYI, AWS Techs informed me this morning they are going to now involve a CloudFormation engineer in my case--current thinking is still high latency is causing Health Checks to Fail, and they are trying to determine root cause.

Has anyone reported to you any latency issues with this setup that I could bring to the attention of the techs working the case? Particularly, once a site is migrated over or a site is heavily built out?

-Mike

darrylsosborne commented 6 years ago

Mike,

The only issue we're aware of is related to EFS file systems running out of burst credits which drops the permitted throughput to the baseline throughput of 50 MiB/s per TiB of storage (or 50 KiB/s per GiB). If the WordPress site has a small amount of file system storage on EFS, it doesn't earn enough credits to sustain the throughput it needs and eventually the burst credit balance drops to zero. File systems earn credits at the baseline throughput rate and consume credits at the throughput driven by the site. If you're using more than your earning, eventually it will drop to zero. Performance may be great initially but when the burst credit balance drops to zero and the permitted throughput changes to the baseline throughput, the site experiences performance issues when the file system is unable to achieve the necessary throughput it demands. That's one of the reasons I built the burst credit balance alarms and dashboard, so customers can monitor and get notified on this condition.

michael-newman commented 6 years ago

Darryl,

FYI, The AWS Techs started to go down the EFS burst credit path as well, although I must say, we had the latency issues before the burst credit issues which basically shut down the site. We started our effort on V1.0 and were deep into trouble shooting while you released v2.0.x--so with the holidays and an unexpected priority, we decided to delete v1.0 and will restart with v2.0.1 later this month. As such, I'll close this issue.

Very excited about what you are doing here, can't wait to pick this up again in a couple of weeks.

Regards, Mike