aws-solutions / instance-scheduler-on-aws

A cross-account and cross-region solution that allows customers to automatically start and stop EC2 and RDS Instances
https://aws.amazon.com/solutions/implementations/instance-scheduler-on-aws/
Apache License 2.0
542 stars 264 forks source link

General questions regarding version 3.0.0 #550

Closed shudhanshu03 closed 2 months ago

shudhanshu03 commented 2 months ago

Hi Team,

I have been evaluating the solution by upgrading the stacks from version 1.5.X to 3.0.0. I have a few questions:

  1. Where can I find the logs for the scheduling mechanism for ASG, similar to how I can see the logs for EC2 and RDS? For EC2 and RDS, I see the logs under SchedulerLogGroup, formatted as <Stack-Name>-logs with the stream format Scheduler-<service>-<account>-<region>-yyyymmdd.

  2. I noticed three log groups related to ASG resources: <stack-name>-ASGHandler, <stack-name>-ASGSchedulerASGOrchestra, and <stack-name>-ASGSchedulerScheduleUpda. Could you please provide insights into these log groups? This information will help me troubleshoot any scheduling issues for ASG resources.

  3. I conducted scenario-based testing for ASG resources by manually updating the configuration (min, max, desired) to observe how it handles these changes during scale-out. The test worked fine; the ASG resource configured with the latest settings during the next scale-out and updated the scheduled action. However, I couldn't find any logs detailing these configuration changes and the internal workings of ASG resources in my hub account. Could you help me locate these logs?

CrypticCabub commented 2 months ago

Hi @shudhanshu03 I believe all the logs you are looking for are under the ASGHandler's lambda log group and the activity history on the ASG itself. Because scheduling is delegated to ASG Scheduled Scaling Rules, there are not Scheduling Decision logs like you would see for EC2/RDS in CloudWatch, but the ASG service itself does provide a pretty detailed audit trail for changes made to the ASG.

For the 3 groups you mentioned, the ASGHandler logs are the ones where you will find actions taken for scheduled scaling rules. The orchestrator and schedule update handlers are responsible for kicking of ASGHandler calls for the purpose of ASG scheduling.

shudhanshu03 commented 2 months ago

Hi @CrypticCabub, thanks for the reply!

I now understand the purpose of each log group. I have one quick question: As you mentioned, most of the logs are under the ASGHandler's Lambda log group. I checked it and found logs that describe all the ASGs of the spoke accounts for each region, along with the actions taken based on the configured schedules. However, I don't see any logs detailing the min, max, and desired configurations of the ASGs.

For example, if a user from a spoke account manually changes the ASG configuration (min, max, desired), the solution will use the last running configuration in the next scheduled period, which I have tested and confirmed. But I don't see logs for these configuration changes in my hub account under the ASGHandler's log group. Is this expected behavior, and if so, that's fine. I hope you understand my concern.

CrypticCabub commented 2 months ago

Instance Scheduler does not currently have a mechanism to detect out-of-band changes to the ASG's scheduled scaling rules. Manual changes to these would need to be tracked through the activity log on the ASG itself.

It does look like we missed including the min-desired-max values in the logs when configuring the schedule. I'll add this to the backlog for our next patch release. Please let us know if there's any other log information you would like to be able to see in the ASG schedule logs.

shudhanshu03 commented 2 months ago

Thank you @CrypticCabub for considering the request for ASG configuration logging during scheduling.

Upon further evaluation of v3.0.0, I have identified the below things:

1.Scheduled actions for ASG are being created when resources are scheduled. However, when the 'Schedule' tag is removed from the ASG, the scheduled actions are not deleted. This causes the resources to get in the start state even though the 'Schedule' key is not present. So ideally, upon removing the 'Schedule' tag from the ASG resource should also delete the scheduled actions associated to that ASG.

2.As mentioned in my previous comment, if a user manually changes the configuration during the scheduling period, the EC2 changes are reflected immediately based on the min, max, and desired values. However, once the scheduling period ends, the latest configuration changes made by the user in the Console are not captured. It is unclear whether this is a bug or a potential future enhancement. Ideally, the system should capture the latest configuration, and the next scheduling period should reflect these changes.

If you require any scenario-based examples to elaborate on point 2, please let me know, and I will provide further explanation

CrypticCabub commented 2 months ago

point 2 was an intentional design desicion due to the fundamental problem of determining when a configuration change should be considered a permanent change or an ephemoral change. This can be updated in the future with a clearer definition of when state transitions should occur, but Instance Scheduler typically errs on the side of not interfering with manual customer action rather than proactive response.

This is the same logic as was applied to 1, as we were not sure if the desired behavior would be to auto-purge all configured actions or to allow that decision to be left up to the operator, so we elected to go with the simpler architecture for the first implementation. If auto-purging these actions on tag deletion is the universally desired behavior we can add it to the backlog as a feature extension for ASG scheduling.

CrypticCabub commented 2 months ago

ASG logging has been updated in v3.0.1

To clarify a little more on the way the ASG scheduler interacts with the min-desired-max values on your Autoscaling Groups:

Once an hour, instance scheduler will scan your accounts for newly tagged ASGs and ASGs whose current schedule configurations have expired and need updating (scheduled-scaling actions are considered accurate for 1 month). When an ASG that needs its schedules updated is identified, Instance Scheduler will then determine the Scheduled Scaling Actions to modify and the min-desired-max values to use for those actions. This determination involes 2 parts:

if the current min-desired-max value of the ASG is valid (not 0-0-0) Instance scheduler will default to using this value for all running periods in the schedule. If however, this value is invalid and a meta tag is present, Instance Scheduler will fall back on the min-desired-max values that were used the last time schedules were configured for this ASG. If both of these are unavailable, Instance Scheduler will throw an error that the ASG is not in a schedulable state.

The majority of the control flow is managed by the "scheduled" metadata tag. As such, you can force Instance Scheduler to update the scheduled actions on an ASG outside of its normal control flow by deleting the scheduled tag from the ASG (this tag is different from the schedule tag that specifies the schedule to use). This will make the ASG look like it was newly tagged and will trigger an update of the ASG's Scheduled Scaling Rules at the next hourly scan.

Outside of this hourly scan for updates, Instance Scheduler will also automatically trigger an update when schedules are modified in your schedule config table to ensure that your tagged ASGs always match the schedules you have configured.

Critically, Instance Scheduler does not attempt to monitor manual alterations to your ASG's min-desired-max values or its Scheduled Scaling Rules that happen outside of its normal control flow.