aws-solutions / instance-scheduler-on-aws

A cross-account and cross-region solution that allows customers to automatically start and stop EC2 and RDS Instances
https://aws.amazon.com/solutions/implementations/instance-scheduler-on-aws/
Apache License 2.0
535 stars 264 forks source link

Schedule stop working #559

Closed pmerisio closed 5 days ago

pmerisio commented 2 weeks ago

Describe the bug

Suddenly, without any change, instance scheduler stop working and, on a total of 26 instances, only 3 are correctly started in the morning. All 26 are still correctly shut down in the evening. As workaround (suggested by AWS support) we remove from state table the entry for one of the instances that are not correctly started and, the next day, that instance has been correctly started. The state table has more than 15.000 entries even if we have less than 100 ec2 running. The last entries of state table starts with: "purge_in_next_cleanup": { "SS": [ followed by more than 2000 instances ID.

To Reproduce

no idea

Expected behavior

all 26 instances must be started at 8:00 AM

Please complete the following information about the solution:

To get the version of the solution, you can look at the description of the created CloudFormation stack. For example, "(SO0030) instance-scheduler-on-aws v1.5.1". You can also find the version from releases

Screenshots If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

Additional context In LOG Scheduler-ec2-961138174181-eu-west-1-20240704 (for example) we can see, that the line

"Listing instance EC2:i-***** (instance-name) in region eu-west-1 with instance type m5.large to be started by scheduler"

is missing for EC2 that are not started (see part of log below)

---> 2024-07-04T08:00:22.377+02:00 DEBUG : [ Instance EC2:i-07f359ee045727796 (eai-test-ec2-integration-server) ] 2024-07-04T08:00:22.377+02:00 DEBUG : Current state is stopped, instance type is i4i.xlarge, schedule is "it-office-hours-ec2" 2024-07-04T08:00:22.377+02:00 INFO : Maintenance window "patch-policies-noprod" used as running period found for instance i-07f359ee045727796 2024-07-04T08:00:22.377+02:00 DEBUG : Time used to determine desired for instance is Thu Jul 4 08:00:22 2024 2024-07-04T08:00:22.377+02:00 DEBUG : Checking conditions for period "patch-policies-noprod-period-1" 2024-07-04T08:00:22.377+02:00 DEBUG : [running] Month "jul" in months (jul) 2024-07-04T08:00:22.377+02:00 DEBUG : [stopped] Day of month 4 not in month days (17) 2024-07-04T08:00:22.377+02:00 DEBUG : Checking conditions for period "patch-policies-noprod-period-2" 2024-07-04T08:00:22.377+02:00 DEBUG : [running] Month "jul" in months (jul) 2024-07-04T08:00:22.377+02:00 DEBUG : [stopped] Day of month 4 not in month days (18) 2024-07-04T08:00:22.377+02:00 DEBUG : Checking for adjacent running periods at current time 2024-07-04T08:00:22.377+02:00 DEBUG : Checking states for previous minute 2024-07-04T08:00:22.377+02:00 DEBUG : Checking conditions for period "patch-policies-noprod-period-1" 2024-07-04T08:00:22.378+02:00 DEBUG : [running] Month "jul" in months (jul) 2024-07-04T08:00:22.378+02:00 DEBUG : [stopped] Day of month 4 not in month days (17) 2024-07-04T08:00:22.378+02:00 DEBUG : Checking conditions for period "patch-policies-noprod-period-2" 2024-07-04T08:00:22.378+02:00 DEBUG : [running] Month "jul" in months (jul) 2024-07-04T08:00:22.378+02:00 DEBUG : [stopped] Day of month 4 not in month days (18) 2024-07-04T08:00:22.378+02:00 DEBUG : Running period(s) for previous minute 2024-07-04T08:00:22.378+02:00 DEBUG : No running periods at this time found in schedule "patch-policies-noprod" for this time, desired state is stopped 2024-07-04T08:00:22.378+02:00 DEBUG : Time used to determine desired for instance is Thu Jul 4 08:00:05 2024 2024-07-04T08:00:22.378+02:00 DEBUG : Checking conditions for period "it-office-hours-ec2" 2024-07-04T08:00:22.456+02:00 DEBUG : [running] Weekday "thu" in weekdays (mon-fri) 2024-07-04T08:00:22.456+02:00 DEBUG : [running] Time 08:00:05 is within 08:00:00-20:00:00, returned state is running 2024-07-04T08:00:22.456+02:00 DEBUG : Active period in schedule "it-office-hours-ec2": "it-office-hours-ec2" 2024-07-04T08:00:22.456+02:00 DEBUG : Desired state for instance from schedule "it-office-hours-ec2" is running, last desired state was running, actual state is stopped ---> 2024-07-04T08:00:22.456+02:00 DEBUG : [ Instance EC2:i-0795830b7bd145911 (observability-poc-eks-node-group-linux) ] 2024-07-04T08:00:22.456+02:00 DEBUG : Current state is stopped, instance type is m5.large, schedule is "it-office-hours-ec2" 2024-07-04T08:00:22.456+02:00 INFO : Maintenance window "patch-policies-noprod" used as running period found for instance i-0795830b7bd145911 2024-07-04T08:00:22.456+02:00 DEBUG : Time used to determine desired for instance is Thu Jul 4 08:00:22 2024 2024-07-04T08:00:22.456+02:00 DEBUG : Checking conditions for period "patch-policies-noprod-period-1" 2024-07-04T08:00:22.456+02:00 DEBUG : [running] Month "jul" in months (jul) 2024-07-04T08:00:22.456+02:00 DEBUG : [stopped] Day of month 4 not in month days (17) 2024-07-04T08:00:22.456+02:00 DEBUG : Checking conditions for period "patch-policies-noprod-period-2" 2024-07-04T08:00:22.456+02:00 DEBUG : [running] Month "jul" in months (jul) 2024-07-04T08:00:22.456+02:00 DEBUG : [stopped] Day of month 4 not in month days (18) 2024-07-04T08:00:22.456+02:00 DEBUG : Checking for adjacent running periods at current time 2024-07-04T08:00:22.456+02:00 DEBUG : Checking states for previous minute 2024-07-04T08:00:22.456+02:00 DEBUG : Checking conditions for period "patch-policies-noprod-period-1" 2024-07-04T08:00:22.456+02:00 DEBUG : [running] Month "jul" in months (jul) 2024-07-04T08:00:22.456+02:00 DEBUG : [stopped] Day of month 4 not in month days (17) 2024-07-04T08:00:22.456+02:00 DEBUG : Checking conditions for period "patch-policies-noprod-period-2" 2024-07-04T08:00:22.456+02:00 DEBUG : [running] Month "jul" in months (jul) 2024-07-04T08:00:22.456+02:00 DEBUG : [stopped] Day of month 4 not in month days (18) 2024-07-04T08:00:22.456+02:00 DEBUG : Running period(s) for previous minute 2024-07-04T08:00:22.456+02:00 DEBUG : No running periods at this time found in schedule "patch-policies-noprod" for this time, desired state is stopped 2024-07-04T08:00:22.456+02:00 DEBUG : Time used to determine desired for instance is Thu Jul 4 08:00:05 2024 2024-07-04T08:00:22.457+02:00 DEBUG : Checking conditions for period "it-office-hours-ec2" 2024-07-04T08:00:22.457+02:00 DEBUG : [running] Weekday "thu" in weekdays (mon-fri) 2024-07-04T08:00:22.457+02:00 DEBUG : [running] Time 08:00:05 is within 08:00:00-20:00:00, returned state is running 2024-07-04T08:00:22.457+02:00 DEBUG : Active period in schedule "it-office-hours-ec2": "it-office-hours-ec2" 2024-07-04T08:00:22.457+02:00 DEBUG : Desired state for instance from schedule "it-office-hours-ec2" is running, last desired state was unknown, actual state is stopped ---> 2024-07-04T08:00:22.457+02:00 DEBUG : Listing instance EC2:i-0795830b7bd145911 (observability-poc-eks-node-group-linux) in region eu-west-1 with instance type m5.large to be started by scheduler

2024-07-04T08:00:22.457+02:00   INFO : Starting instances EC2:i-074b37b2c9f14ae24 (observability-poc-eks-node-group-linux), EC2:i-019e2cdeea2241ea7 (observability-poc-eks-node-group-linux), EC2:i-0d12885f5384fec21 (observability-poc-eks-node-group-linux), EC2:i-05456a291c2addef4 (observability-poc-eks-node-group-linux), EC2:i-029ff52eb27593a82 (observability-poc-eks-node-group-linux), EC2:i-0795830b7bd145911 (observability-poc-eks-node-group-linux) in region eu-west-1
2024-07-04T08:00:24.388+02:00   INFO : Cleaning up instance registry.
CrypticCabub commented 2 weeks ago

The logs provided don't appear to indicate anything out of the ordinary, but i do see that you are running Instance Scheduler 1.5.0 when the latest release is 3.0.1. If possible, I would recommend updating to the latest version and seeing if one of the many bug-fixes that have occured since 1.5.0 fixes the issue for you (at the very least, logging is improved in 3.x and should provide much clearer indications of what exactly is going on).

Update instructions for the solution are available here: https://docs.aws.amazon.com/solutions/latest/instance-scheduler-on-aws/update-the-solution.html

CrypticCabub commented 5 days ago

Closing this. Please re-open if you are still having issues and the solution provided above does not work for you