We are running a association via State Manager which using the AWS-managed document AWS-GatherSoftwareInventory. The association has a schedule expression as rate, which is to every 12 hours.
We are using Event Bridge rule, which listing to the pattern with source => aws.ssm + detail-status => Failed. Each time an association state was failed, we are getting notified via SNS/Mail.
Looking through the UpdateInstanceAssociationStatus CloudTrail events, we found that the association was failing on a variety of instances (Windows and Linux) - Details see "Current Behavior". This triggering the EventBridge rule.
Looking through the AWS Console, the association is in status "Success" within execution history. Hence, it's not fitting the status "Failed" from the event. Furthermore, we don't see errors in applying the document.
Current Behavior
As the association state is currently showing 'Success' in Console on each instance, the association seems to be failing intermittently based on the events.
Checking the UpdateInstanceAssociationStatus events (occuring multiple times per instance at the same point in time), we see error-details with StuckAtInProgress / Association stuck at InProgress for longer than 2 hours for at least one failed event for AWS-GatherSoftwareInventory:
Checking the local errors.log of SSM Agent at the same time of the event, we see:
(Windows - Server 2016 Datacenter Build 14393, SSM Agent Version: 3.3.551.0)
2024-08-23 17:23:17 ERROR [runScheduledAssociation @ processor.go.305] [ssm-agent-worker] [MessageService] [Association] Association stuck at InProgress for longer than 2 Hours
(Linux - Amazon Linux 2, SSM Agent Version: 3.3.380.0)
2024-08-22 04:57:28 ERROR [runScheduledAssociation @ processor.go.313] [ssm-agent-worker] [MessageService] [Association] Association stuck at InProgress for longer than 2 Hours
We checked this behavior with the SSM Team (Ethan) - We uploaded the full local logs and the full events to case 172380074600709.
Expected Behavior:
SSM Team to investigate into this issue and fix it - We heard that SSM Team is aware of this bug. Please update this github issue #584 (also feel free to add more details), to enable us (and others) to see the progress.
Results shown in Console and Events should match - We like to avoid getting notified for failed associations, but it's not failed or stuck at all (false-positives).
Workaround:
For now, if we receive any notifications for the AWS-GatherSoftwareInventory association has failed, we verify its status in the State Manager Console first. If the association shows as success, then we have to ignore the notification. That's annoying repetitive daily work.
Furthermore, we are trying to utilize the anything-but matching to "filter" on associationId (which have the AWS-GatherSoftwareInventory) within the rule. So we "disabled" event-based notification for this dedicated association completely for now.
Describe the Setup & the Bug
AWS-GatherSoftwareInventory
. The association has a schedule expression as rate, which is to every 12 hours.source
=>aws.ssm
+detail-status
=>Failed
. Each time an association state was failed, we are getting notified via SNS/Mail.Current Behavior
UpdateInstanceAssociationStatus
events (occuring multiple times per instance at the same point in time), we see error-details withStuckAtInProgress / Association stuck at InProgress for longer than 2 hours
for at least one failed event for AWS-GatherSoftwareInventory:Checking the local
errors.log
of SSM Agent at the same time of the event, we see: (Windows - Server 2016 Datacenter Build 14393, SSM Agent Version: 3.3.551.0)(Linux - Amazon Linux 2, SSM Agent Version: 3.3.380.0)
Expected Behavior:
Workaround:
anything-but
matching to "filter" on associationId (which have the AWS-GatherSoftwareInventory) within the rule. So we "disabled" event-based notification for this dedicated association completely for now.