aws-solutions / instance-scheduler-on-aws

A cross-account and cross-region solution that allows customers to automatically start and stop EC2 and RDS Instances
https://aws.amazon.com/solutions/implementations/instance-scheduler-on-aws/
Apache License 2.0
542 stars 264 forks source link

Error in logs after enabling maintenance windows in v3.0.0 #553

Closed grahamwright19 closed 2 months ago

grahamwright19 commented 2 months ago

I have just upgraded the solution to v3.0.0 now that there is support for multiple maintenance windows in the same schedule, but I am seeing a lot of errors in the lambda logs and there is no data in the Maintenance Windows DynamoDB table. Here is the error from the lamda logs

('NextExecutionTime') Traceback (most recent call last): File "/var/task/instance_scheduler/handler/scheduling_request.py", line 128, in handle_scheduling_request return handler.handle_request() ^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/task/instance_scheduler/handler/scheduling_request.py", line 225, in handle_request result: Final = {scheduling_context.account_id: scheduler.run()} ^^^^^^^^^^^^^^^ File "/var/task/instance_scheduler/schedulers/instance_scheduler.py", line 59, in run result = self._run_scheduler() ^^^^^^^^^^^^^^^^^^^^^ File "/var/task/instance_scheduler/schedulers/instance_scheduler.py", line 121, in _run_scheduler for decision in self.make_scheduling_decisions( File "/var/task/instance_scheduler/schedulers/instance_scheduler.py", line 163, in make_scheduling_decisions for instance in instances: File "/var/task/instance_scheduler/service/ec2.py", line 124, in describe_tagged_instances ec2_instance = self._select_instance_data(instance) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/task/instance_scheduler/service/ec2.py", line 144, in _select_instance_data maint_windows = self._fetch_mw_schedules_for(schedule) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/task/instance_scheduler/service/ec2.py", line 166, in _fetch_mw_schedules_for maint_windows.extend(self.mw_context.find_by_name(requested_mw_name)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/task/instance_scheduler/maint_win/maintenance_window_context.py", line 140, in find_by_name self.reconcile_ssm_with_dynamodb() File "/var/task/instance_scheduler/maint_win/maintenance_window_context.py", line 76, in reconcile_ssm_with_dynamodb filtered_ssm_data = _collect_by_nameid( ^^^^^^^^^^^^^^^^^^^ File "/var/task/instance_scheduler/maint_win/maintenance_window_context.py", line 196, in _collect_by_nameid return {mw.name_id: mw for mw in maintenance_windows} ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/task/instance_scheduler/maint_win/maintenance_window_context.py", line 196, in return {mw.name_id: mw for mw in maintenance_windows} ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/task/instance_scheduler/maint_win/maintenance_window_context.py", line 157, in filter_by_windows_defined_in_schedules for window in raw_windows: File "/var/task/instance_scheduler/maint_win/ssm_mw_client.py", line 32, in get_mws_from_ssm yield EC2SSMMaintenanceWindow.from_identity( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/task/instance_scheduler/model/maint_win.py", line 278, in from_identity next_execution_time=isoparse(identity["NextExecutionTime"]),


KeyError: 'NextExecutionTime'

Here is an example of one of the schedules

"name": "infosec-cert-weekends",
"periods": "infosec-cert-weekends-period-0003,infosec-cert-weekends-period-0001,infosec-cert-weekends-period-0002",
"timezone": "UTC",
"description": "start instances at 1pm utc on fridaye and stop them at 1pm utc on monday",
"ssm_maintenance_window": [
"np-lsc-host-ssm-us-east-1-az1-patch-window",
"np-lsc-host-ssm-us-east-1-az2-patch-window"
]

I am not sure where the issue lies, but have only started seeing errors since enabling the maintenance windows feature
Thanks
grahamwright19 commented 2 months ago

Some more information I have just found is the role that is configured for this to get the data from SSM for the maintenance windows hasn't tried accessing SSM as IAM shows no activity image

aws-khargita commented 2 months ago

Hi @grahamwright19 thank you for opening this issue.

I have identified the issue being the describe_maintenance_windows endpoint called in the SSMMWClient not returning a NextExecutionTime under certain circumstances. This could occur if the window's end date had past or if the configuration of the maintenance window results in it never becoming active.

As for:

Some more information I have just found is the role that is configured for this to get the data from SSM for the maintenance windows hasn't tried accessing SSM as IAM shows no activity image

There may be a delay up to 4 hours before seeing the usage there.

We are working on a 3.0.1 patch and will include the fix to this issue then. Let me know if you have any other questions!

grahamwright19 commented 2 months ago

Thanks for the information. I have been doing some more testing and have been able to get the maintenance windows to load into the DynamoDB table for all regions except for us-east-1 (The home region where it is deployed), so the issue appears to be limited here

grahamwright19 commented 2 months ago

I have found out what the issue was. It appeared that someone had created a couple of maintenance windows in one of the accounts in us-east-1 that had a schedule which had passed. As soon as i removed these, all my errors went away. Thanks for the assistance

aws-khargita commented 2 months ago

Reopening this issue as the behavior should be to skip these problematic maintenance windows rather than cause cascading failures. Will close once patch is released.

CrypticCabub commented 2 months ago

Fixed in v3.0.1