cloudcaptainsh / cloudcaptain

Issue Tracker for CloudCaptain
13 stars 3 forks source link

deploy didn't start enough instances #230

Open emilburzo opened 5 years ago

emilburzo commented 5 years ago

Hello,

We just had a situation where there were 5 old running instances, but the deploy only started 2 new instances (which wasn't enough to handle the current traffic)

What could have caused this?

08:19:45.609 Configuring main-api to auto-scale between 2 and 20 c5.xlarge instances based on average CPU load over 300 seconds, scaling in at 30% and below, scaling out at 50% and above ...
08:19:49.078 Using configured security group: sg-c387c7ba (main-api-production-instance)
08:19:49.078 Creating Launch Template boxlt-jtheuer-production-main-api-2359 ...
08:19:49.078 Creating Auto Scaling Group boxasg-jtheuer-production-main-api-2359 ...
08:19:51.416 Creating Auto Scaling Policy boxasg-jtheuer-production-main-api-2359-scaleout-policy ...
08:19:51.416 Creating Auto Scaling Policy boxasg-jtheuer-production-main-api-2359-scalein-policy ...
08:19:52.577 Creating Cloud Watch Alarm boxasg-jtheuer-production-main-api-2359-scaleout-policy-alarm ...
08:19:52.577 Creating Cloud Watch Alarm boxasg-jtheuer-production-main-api-2359-scalein-policy-alarm ...
08:19:52.577 Waiting for Auto Scaling Group boxasg-jtheuer-production-main-api-2359 to launch 2 c5.xlarge Instances ...
08:19:56.246 Auto Scaling Group: i-0151bd6b5ae32e116 [Pending]
08:19:56.246 Auto Scaling Group: i-09e1c4665cf25df16 [Pending]
08:20:29.201 Auto Scaling Group: i-0151bd6b5ae32e116 [InService]
08:20:29.202 Auto Scaling Group: i-09e1c4665cf25df16 [InService]
08:20:31.525 Waiting for ELB to put instances in service ...
08:20:32.688 ELB: i-0151bd6b5ae32e116 [OutOfService] => Instance registration is still in progress.
08:20:32.689 ELB: i-09e1c4665cf25df16 [OutOfService] => Instance registration is still in progress.
08:20:39.660 ELB: i-0151bd6b5ae32e116 [OutOfService] => Instance has not passed the configured HealthyThreshold number of health checks consecutively.
08:20:39.661 ELB: i-09e1c4665cf25df16 [OutOfService] => Instance has not passed the configured HealthyThreshold number of health checks consecutively.
08:20:58.348 ELB: i-09e1c4665cf25df16 [InService]
08:21:00.657 ELB: i-0151bd6b5ae32e116 [InService]
08:21:00.657 Destroying all Instances in Auto Scaling Group boxasg-jtheuer-production-main-api-2345 ...
08:21:02.979 Destroying Cloud Watch Alarm boxasg-jtheuer-production-main-api-2345-scalein-policy-alarm ...
08:21:02.979 Destroying Cloud Watch Alarm boxasg-jtheuer-production-main-api-2345-scaleout-policy-alarm ...
08:21:02.979 Destroying Auto Scaling Policy boxasg-jtheuer-production-main-api-2345-scalein-policy ...
08:21:02.980 Destroying Auto Scaling Policy boxasg-jtheuer-production-main-api-2345-scaleout-policy ...
08:21:02.980 Destroying Auto Scaling Group boxasg-jtheuer-production-main-api-2345 ...
08:21:35.832 Destroying Launch Template boxlt-jtheuer-production-main-api-2345 ...
08:21:35.833 Destroying all Instances in Auto Scaling Group boxasg-jtheuer-production-main-api-2355 ...
08:22:06.476 Auto Scaling Group: i-029ccf4504b976991 [Terminating]
08:22:07.643 Auto Scaling Group: i-043869960342b1d2d [Terminating]
08:22:07.643 Auto Scaling Group: i-0aecfb9b677899ff6 [Terminating]
08:22:07.643 Auto Scaling Group: i-035342cbcd323c063 [Terminating]
08:22:07.643 Auto Scaling Group: i-00df365781e56f0c3 [Terminating]
08:24:12.827 Destroying Cloud Watch Alarm boxasg-jtheuer-production-main-api-2355-scalein-policy-alarm ...
08:24:12.827 Destroying Cloud Watch Alarm boxasg-jtheuer-production-main-api-2355-scaleout-policy-alarm ...
08:24:12.827 Destroying Auto Scaling Policy boxasg-jtheuer-production-main-api-2355-scalein-policy ...
08:24:13.981 Destroying Auto Scaling Policy boxasg-jtheuer-production-main-api-2355-scaleout-policy ...
08:24:13.981 Destroying Auto Scaling Group boxasg-jtheuer-production-main-api-2355 ...
08:24:38.567 Destroying Launch Template boxlt-jtheuer-production-main-api-2355 ...
08:24:38.567 Destroying Launch Template boxlt-jtheuer-production-main-api-2355 ...
08:24:38.567 Destroying Launch Template boxlt-jtheuer-production-main-api-2345 ...
axelfontaine commented 5 years ago

Could it be that 3 of the previous instances were not longer in the running state and maybe already terminating due to a scale in event?

emilburzo commented 5 years ago

That's very unlikely, traffic was in a (slowly) increasing trend

axelfontaine commented 5 years ago

Take 2: for some reason both the 2355 and 2345 ASGs still existed and our code took the capacity from the 2345 instead of the 2355 for the new 2359 ASG

emilburzo commented 5 years ago

Very interesting, is it possible to work-around such situations?

emilburzo commented 5 years ago

This occurred again today, any updates on this?

jtheuer commented 5 years ago

Yeah, I guess there was a failed earlier deploy. Any chance to get that fixed? By e.g. taking the currently registered ASG (assuming that there is only one per ELB)? Alternatively if both ASGs where still registered: The sum of all instances which would be safer than just the instances of one ASG