aws-solutions / instance-scheduler-on-aws

A cross-account and cross-region solution that allows customers to automatically start and stop EC2 and RDS Instances
https://aws.amazon.com/solutions/implementations/instance-scheduler-on-aws/
Apache License 2.0
553 stars 274 forks source link

Instance Scheduler doesn’t stop RDS instance after MaintenanceWindow is completed. #384

Open natalia-1702 opened 1 year ago

natalia-1702 commented 1 year ago

Instance Scheduler doesn’t stop instance after MaintenanceWindow is completed.

I have deployed Main CF stack into central account, deployed remote account to all other accounts(workload accounts) in my AWS Organisation. So all RDS instances which I am scheduling are hosted in workload accounts. Regular schedules which I create work perfectly.

According to documentation “Instance Scheduler“ can start and stop RDS instances for MaintenanceWindows automatically. It starts instances at the start of MW and stops them at the end of MW.

But in my case this feature doesn’t work in 50% cases.

During MW RDS instances are started, but not stopped by Instance Scheduler in some reason. I don’t see any errors in log files. Just nothing, no any event.

For example MW for TEST-RDS-INSTANCE instance 25/02/2023 00:00 - 00:30 RDS Instance has been started at 00:00 , but hasn’t been stopped at 00:30.


25/02/2023  00:00 INFO    : Maintenance window "RDS preferred Maintenance Window Schedule" used as running period found for instance TEST-RDS-INSTANCE

25/02/2023  00:00 DEBUG   : Listing instance RDS:TEST-RDS-INSTANCE (TEST-RDS-INSTANCE) in region ap-southeast-2 with instance type db.t3.small to be started by scheduler

25/02/2023  00:00 INFO    : Adding start tags [{'Key': 'status', 'Value': 'started'}] to instance arn:aws:rds:ap-southeast-2:<accountnumber>:db:TEST-RDS-INSTANCE

25/02/2023  00:00 INFO    : Starting instances RDS:TEST-RDS-INSTANCE (TEST-RDS-INSTANCE) in region ap-southeast-2

25/02/2023  00:00 INFO    : Scheduler result {'<accountnumber>': {'started': {'ap-southeast-2': [{'TEST-RDS-INSTANCE': {'schedule': 'rds-mon-fri0800-2300-sat-sunxxxx-2000-Brisbane'}}..........]}, 'stopped': {}}}

#DEBUG_SKIPPING_INSTANCE message
#https://github.com/aws-solutions/aws-instance-scheduler/blob/main/source/lambda/schedulers/rds_service.py
#looks like in some reason "IS" was trying to stop in at 00:12 instead of 00:30
25/02/2023  00:12 DEBUG   : Skipping rds instance TEST-RDS-INSTANCE because it is not in a start or stop-able state (maintenance)

#25/02/2023  00:30 
#there is no  "Stopping instances" event for this instance  25/02/2023  00:30 !!!!!!!!!!!!!!!!!!!!!!!!!!

Worth to mention:

natalia-1702 commented 1 year ago

More information. link for documentation: https://s3.amazonaws.com/solutions-reference/aws-instance-scheduler/latest/instance-scheduler.pdf

image

example of schedule which I use:

 RdsMonFri08002300SatSunxxxx2000Brisbane:
    Type: 'Custom::ServiceInstanceSchedule'
    Properties:
      Name: 'rds-mon-fri0800-2300-sat-sunxxxx-2000-Brisbane'
      NoStackPrefix: 'True'
      Description: "Some description"
      ServiceToken: >-
        arn:aws:lambda:ap-southeast-2:<accountnumber>:function:infrastructure-instance-scheduler-InstanceSchedulerMain
      Timezone: Australia/Brisbane
      UseMaintenanceWindow: 'True'
      Periods:
        - Description: mon-fri0800-2300
          BeginTime: '08:00'
          EndTime: '23:00'
          WeekDays: Mon-Fri
        - Description: sat-sunxxxx-2000
          EndTime: '20:00'
          WeekDays: Sat-Sun 

region: the only region I use: ap-southeast-2

natalia-1702 commented 1 year ago

What we noticed:

it happens only with those instances which have schedules without starting time: for example this RDS instance has following settings:

tag: schedule rds-mon-fri0800-2300-sat-sunxxxx-2000-Brisbane

RdsMonFri08002300SatSunxxxx2000Brisbane:
   Type: 'Custom::ServiceInstanceSchedule'
   Properties:
     Name: 'rds-mon-fri0800-2300-sat-sunxxxx-2000-Brisbane'
     NoStackPrefix: 'True'
     Description: "Some description"
     ServiceToken: >-
       arn:aws:lambda:ap-southeast-2:<accountnumber>:function:infrastructure-instance-scheduler-InstanceSchedulerMain
     Timezone: Australia/Brisbane
     UseMaintenanceWindow: 'True'
     Periods:
       - Description: mon-fri0800-2300
         BeginTime: '08:00'
         EndTime: '23:00'
         WeekDays: Mon-Fri
       - Description: sat-sunxxxx-2000
         EndTime: '20:00'
         WeekDays: Sat-Sun 

Maintenance window: Every Saturday 02:00 - 02:30 UTC+11

we expect on Saturday: instance to be started at 02:00(MW), stopped at 02:30(MW), after that it will be stopped at 20:00(Brisbane time) if someone start it after 02:30.

But instead of that we get this:Instance is started at 02:00(MW started) and is stopped at 20:00(Brisbane time) by period. And it was working for 18 hours…

if we change MW time and put it after stopping time - it will work properly.

Is this behaviour bug or feature?

CrypticCabub commented 1 year ago

Hi @nalalia-1702. The behavior you describe would be expected behavior that emerges due to the current scheduling logic.

Instance scheduler looks at each day individually, and a 1-sided stop schedule has the effect of splitting a day into two halves: an "any" period (in which the instance can be in any state) and a "stopped" period (in which the instance is expected to be off). Adding a maintenance window injects an extra running period into the middle of the existing schedule which results in a schedule that looks something like this:

any -- running -- any -- stopped

the result being exactly the behavior you describe. Whether this is a bug or a feature would be open to some interpretation, but I will bring it up with the rest of the team and add it as a backlog item for evaluation

natalia-1702 commented 1 year ago

@CrypticCabub I get it , thank you.

To be honest would be really good to have possibility to have schedules like this: Monday-Sunday: 1st period: starts at 6am, stops at 6pm. 2nd period: ( without starting), but stop at 10pm. I believe it is very useful feature. If someone from the team powered on the instance between 6pm and 10pm - it will be stopped anyway at 10pm to save costs.

Thank you.

matt-it-guy commented 5 months ago

Any updates on this issue? We are seeing the same behaviour described in this ticket. We are on verison 1.5.6. For example we have a database that turns on for maintenance and it will stay on until the end time instead of turning off after maintenance.

CrypticCabub commented 5 months ago

Hi @matt-it-guy -- do you also have the same scenario as I discussed with natalia above? (a 1-sided start/stop period on the same day as the maintenance window which causes an overlap).

That particular issue is a fundamental issue with the current implementation of 1-sided periods. The team is considering potential fixes but have not settled on a satisfactory approach at present.

However, If your scenario is different from the overlapping 1-sided period scenario, please let us know the specific configuration of your schedule so that we can dig into the issue further.

matt-it-guy commented 5 months ago

Thanks for your response. We have the same scenario as Natalia.