aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.04k stars 323 forks source link

Crash and reboot loop on custom SSM Document #392

Closed dorinclisu closed 2 years ago

dorinclisu commented 3 years ago

OS version: Ubuntu 20.04 Agent version: 3.0.1295.0 Installation: snap channel candidate

I have a custom reboot document:

---
schemaVersion: "2.2"
description: "Reboot instance"
mainSteps:
- action: "aws:runShellScript"
  name: "reboot"
  inputs:
    runCommand:
    - |
      UPTIME_SECONDS=`awk '{print $1}' /proc/uptime`
      UPTIME_SECONDS=${UPTIME_SECONDS%.*}
      echo "Uptime: $UPTIME_SECONDS seconds"
      if [ $UPTIME_SECONDS -gt 60 ]; then
        echo "rebooting ..."
        exit 194
      else
        echo "rebooted"
      fi

Running it as a standalone command works fine. Problem is that I also have another custom document, which in turn has some aws:runDocument steps and one of the documents in turn calls the reboot document. The instance gets into a reboot loop from which it only recovers after canceling the command from the console and then restarting the ssm agent service a couple of times (from ssh).

Here are some relevant logs from the agent showing the go stacktrace:

2021-07-09 15:07:35 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] Validating SSM parameters
2021-07-09 15:07:35 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] Running sub-document
2021-07-09 15:07:35 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] [BasicExecuter] Running plugin aws:runShellScript reboot
2021-07-09 15:07:35 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runShellScript] aws:runShellScript started with configuration {0xc00035be60 map[runCommand:[UPTIME_SECONDS=`awk '{print $1}' /proc/uptime`
UPTIME_SECONDS=${UPTIME_SECONDS%.*}
echo "Uptime: $UPTIME_SECONDS seconds"
if [ $UPTIME_SECONDS -gt 60 ]; then
  echo "rebooting ..."
  exit 194
else
  echo "rebooted"
fi]] 556660fe-a2de-4fc8-9452-fbd95f940cf6/mi-02cd7937012d59b34/awsrunDocument/awsrunDocument/awsrunShellScript  false  false false /var/lib/amazon/ssm/mi-02cd7937012d59b34/document/orchestration/556660fe-a2de-4fc8-9452-fbd95f940cf6/driver/reboot/reboot aws.ssm.556660fe-a2de-4fc8-9452-fbd95f940cf6.mi-02cd7937012d59b34 reboot aws:runShellScript reboot  map[] true []    false  { } }
2021-07-09 15:07:35 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runShellScript] Unexpected 'TimeoutSeconds' value <nil> received. Setting 'TimeoutSeconds' to default value 3600
2021-07-09 15:07:35 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runShellScript] 'TimeoutSeconds' value should be between 5 and 172800. Setting 'TimeoutSeconds' to default value 3600
2021-07-09 15:07:35 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runShellScript] The execution of command returned Exit Status: 194
2021-07-09 15:07:35 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] [BasicExecuter] Sending plugin reboot completion message
2021-07-09 15:07:35 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] Sending plugin reboot completion message
2021-07-09 15:07:35 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] Sending plugin driver completion message
2021-07-09 15:07:35 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] document execution complete
2021-07-09 15:07:35 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] sending document complete response...
2021-07-09 15:07:35 ERROR [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] write file /var/lib/amazon/ssm/mi-02cd7937012d59b34/channels/556660fe-a2de-4fc8-9452-fbd95f940cf6/tmp/worker-20210709150713-000 encountered error: open /var/lib/amazon/ssm/mi-02cd7937012d59b34/channels/556660fe-a2de-4fc8-9452-fbd95f940cf6/tmp/worker-20210709150713-000: no such file or directory

2021-07-09 15:07:35 ERROR [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] failed to send message to ipc channel: open /var/lib/amazon/ssm/mi-02cd7937012d59b34/channels/556660fe-a2de-4fc8-9452-fbd95f940cf6/tmp/worker-20210709150713-000: no such file or directory
2021-07-09 15:07:35 ERROR [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] messaging worker encountered error: open /var/lib/amazon/ssm/mi-02cd7937012d59b34/channels/556660fe-a2de-4fc8-9452-fbd95f940cf6/tmp/worker-20210709150713-000: no such file or directory
2021-07-09 15:07:28 INFO [ssm-agent-worker] [OfflineService] All workers have finished and pool has been put into shutdown
2021-07-09 15:07:28 INFO [ssm-agent-worker] [OfflineService] JobQueue has been closed
2021-07-09 15:07:28 INFO [ssm-agent-worker] [OfflineService] All workers have finished and pool has been put into shutdown
2021-07-09 15:07:36 INFO [ssm-agent-worker] [MessagingDeliveryService] [Association] No associations on boot. Requerying for associations after 30 seconds.
2021-07-09 15:07:45 INFO [ssm-agent-worker] [MessagingDeliveryService] JobQueue has been closed
2021-07-09 15:07:45 INFO [ssm-agent-worker] [MessagingDeliveryService] JobQueue has been closed
2021-07-09 15:07:45 INFO [ssm-agent-worker] [MessagingDeliveryService] All workers have finished and pool has been put into shutdown
2021-07-09 15:07:45 INFO [ssm-agent-worker] [MessagingDeliveryService] All workers have finished and pool has been put into shutdown
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] Sending plugin disableNouveau completion message
2021-07-09 15:07:45 INFO [ssm-agent-worker] [MessagingDeliveryService] [Association] JobQueue has been closed
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] Running plugin aws:runDocument reboot
2021-07-09 15:07:45 INFO [ssm-agent-worker] [MessagingDeliveryService] [Association] All workers have finished and pool has been put into shutdown
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] Plugin aws:runDocument started with configuration{0xc0005e0348 map[documentPath:Reboot documentType:SSMDocument] 556660fe-a2de-4fc8-9452-fbd95f940cf6/mi-02cd7937012d59b34/awsrunDocument/awsrunDocument  false  false false /var/lib/amazon/ssm/mi-02cd7937012d59b34/document/orchestration/556660fe-a2de-4fc8-9452-fbd95f940cf6/driver/reboot aws.ssm.556660fe-a2de-4fc8-9452-fbd95f940cf6.mi-02cd7937012d59b34 driver aws:runDocument reboot  map[] true []    false  { } }
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] Depth of execution - 2
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] Sending the document received for parsing - {
  "schemaVersion" : "2.2",
  "description" : "Reboot instance",
  "mainSteps" : [ {
    "action" : "aws:runShellScript",
    "name" : "reboot",
    "inputs" : {
      "runCommand" : [ "UPTIME_SECONDS=`awk '{print $1}' /proc/uptime`\nUPTIME_SECONDS=${UPTIME_SECONDS%.*}\necho \"Uptime: $UPTIME_SECONDS seconds\"\nif [ $UPTIME_SECONDS -gt 60 ]; then\n  echo \"rebooting ...\"\n  exit 194\nelse\n  echo \"rebooted\"\nfi" ]
    }
  } ]
}
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] Validating SSM parameters
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] Running sub-document
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] [BasicExecuter] Running plugin aws:runShellScript reboot
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runShellScript] aws:runShellScript started with configuration {0xc0005e0fc0 map[runCommand:[UPTIME_SECONDS=`awk '{print $1}' /proc/uptime`
UPTIME_SECONDS=${UPTIME_SECONDS%.*}
echo "Uptime: $UPTIME_SECONDS seconds"
if [ $UPTIME_SECONDS -gt 60 ]; then
  echo "rebooting ..."
  exit 194
else
  echo "rebooted"
fi]] 556660fe-a2de-4fc8-9452-fbd95f940cf6/mi-02cd7937012d59b34/awsrunDocument/awsrunDocument/awsrunShellScript  false  false false /var/lib/amazon/ssm/mi-02cd7937012d59b34/document/orchestration/556660fe-a2de-4fc8-9452-fbd95f940cf6/driver/reboot/reboot aws.ssm.556660fe-a2de-4fc8-9452-fbd95f940cf6.mi-02cd7937012d59b34 reboot aws:runShellScript reboot  map[] true []    false  { } }
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runShellScript] Unexpected 'TimeoutSeconds' value <nil> received. Setting 'TimeoutSeconds' to default value 3600
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runShellScript] 'TimeoutSeconds' value should be between 5 and 172800. Setting 'TimeoutSeconds' to default value 3600
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runShellScript] The execution of command returned Exit Status: 194
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] [pluginName=aws:runDocument] [BasicExecuter] Sending plugin reboot completion message
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] [pluginName=aws:runDocument] [BasicExecuter] Sending plugin reboot completion message
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] Sending plugin driver completion message
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] document execution complete
2021-07-09 15:07:46 INFO [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] [DataBackend] sending document complete response...
2021-07-09 15:07:46 ERROR [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] write file /var/lib/amazon/ssm/mi-02cd7937012d59b34/channels/556660fe-a2de-4fc8-9452-fbd95f940cf6/tmp/worker-20210709150725-000 encountered error: open /var/lib/amazon/ssm/mi-02cd7937012d59b34/channels/556660fe-a2de-4fc8-9452-fbd95f940cf6/tmp/worker-20210709150725-000: no such file or directory

2021-07-09 15:07:46 ERROR [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] failed to send message to ipc channel: open /var/lib/amazon/ssm/mi-02cd7937012d59b34/channels/556660fe-a2de-4fc8-9452-fbd95f940cf6/tmp/worker-20210709150725-000: no such file or directory
2021-07-09 15:07:46 ERROR [ssm-document-worker] [556660fe-a2de-4fc8-9452-fbd95f940cf6] messaging worker encountered error: open /var/lib/amazon/ssm/mi-02cd7937012d59b34/channels/556660fe-a2de-4fc8-9452-fbd95f940cf6/tmp/worker-20210709150725-000: no such file or directory
2021-07-09 15:07:45 INFO [ssm-agent-worker] [MessagingDeliveryService] [Association] JobQueue has been closed
2021-07-09 15:07:45 INFO [ssm-agent-worker] [MessagingDeliveryService] [Association] All workers have finished and pool has been put into shutdown
2021-07-09 15:08:07 INFO [ssm-agent-worker] [MessagingDeliveryService] [Association] Schedule manager refreshed with 0 associations, 0 newassociations associated
2021-07-09 15:08:26 INFO [amazon-ssm-agent] Got signal:terminated value:0x555adff132a0
2021-07-09 15:08:26 INFO [amazon-ssm-agent] Stopping Core Agent
2021-07-09 15:08:26 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] Receiving stop signal, stop worker monitor
2021-07-09 15:08:26 INFO [ssm-agent-worker] Received Core Agent termination request {"SchemaVersion":1,"Topic":"TerminateWorkerRequest","Payload":null}
2021-07-09 15:08:26 INFO [ssm-agent-worker] Received Core Agent termination signal, terminating ssm-agent-worker
2021-07-09 15:08:26 INFO [ssm-agent-worker] Received core agent reboot signal
2021-07-09 15:08:26 INFO [ssm-agent-worker] Stopping ssm agent worker
2021-07-09 15:08:26 INFO [ssm-agent-worker] [instanceID=mi-02cd7937012d59b34] core manager stop requested. Stop type: HardStop
2021-07-09 15:08:26 INFO [ssm-agent-worker] [MessageGatewayService] Stopping MessageGatewayService.
2021-07-09 15:08:26 INFO [ssm-agent-worker] [MessageGatewayService] Closing controlchannel with channel Id mi-02cd7937012d59b34
2021-07-09 15:08:26 INFO [ssm-agent-worker] [MessageGatewayService] Closing websocket channel connection to: wss://ssmmessages.eu-central-1.amazonaws.com/v1/control-channel/mi-02cd7937012d59b34?role=subscribe&stream=input
2021-07-09 15:08:26 ERROR [ssm-agent-worker] [instanceID=mi-02cd7937012d59b34] Core module stop request panic: close of closed channel
2021-07-09 15:08:26 ERROR [ssm-agent-worker] [instanceID=mi-02cd7937012d59b34] Stacktrace:
goroutine 280 [running]:
runtime/debug.Stack(0xc000459140, 0x5559312541dd, 0x22)
        runtime/debug/stack.go:24 +0x9f
github.com/aws/amazon-ssm-agent/agent/framework/coremanager.(*CoreManager).stopCoreModules.func1.1(0x55593165e720, 0xc000459140)
        github.com/aws/amazon-ssm-agent/agent/framework/coremanager/coremanager.go:249 +0xda
panic(0x55593149dbc0, 0x55593160de80)
        runtime/panic.go:969 +0x1b9
github.com/aws/amazon-ssm-agent/agent/framework/processor.(*EngineProcessor).Stop(0xc00033e310, 0x555931233fc2, 0x8)
        github.com/aws/amazon-ssm-agent/agent/framework/processor/processor.go:227 +0x165
github.com/aws/amazon-ssm-agent/agent/runcommand.(*RunCommandService).ModuleRequestStop(0xc000342000, 0x555931233fc2, 0x8, 0x5559314f43e0,0x0)
        github.com/aws/amazon-ssm-agent/agent/runcommand/coremodule.go:100 +0x62
github.com/aws/amazon-ssm-agent/agent/framework/coremanager.(*CoreManager).stopCoreModules.func1(0x55593165e720, 0xc000459140, 0x555931233fc2, 0x8, 0xc0005e4040, 0xc000611990, 0x1)
        github.com/aws/amazon-ssm-agent/agent/framework/coremanager/coremanager.go:258 +0x107
created by github.com/aws/amazon-ssm-agent/agent/framework/coremanager.(*CoreManager).stopCoreModules
        github.com/aws/amazon-ssm-agent/agent/framework/coremanager/coremanager.go:245 +0x17e

2021-07-09 15:08:26 ERROR [ssm-agent-worker] [MessageGatewayService] MessageGatewayService ModuleRequestStop run panic: close of closed channel
2021-07-09 15:08:26 ERROR [ssm-agent-worker] [MessageGatewayService] Stacktrace:
goroutine 281 [running]:
runtime/debug.Stack(0xc0001afda0, 0xc00019de00, 0x60)
        runtime/debug/stack.go:24 +0x9f
github.com/aws/amazon-ssm-agent/agent/session.(*Session).ModuleRequestStop.func1(0x55593165e720, 0xc0001afda0)
        github.com/aws/amazon-ssm-agent/agent/session/session.go:222 +0xda
panic(0x55593149dbc0, 0x55593160de80)
        runtime/panic.go:969 +0x1b9
github.com/aws/amazon-ssm-agent/agent/framework/processor.(*EngineProcessor).Stop(0xc00033e380, 0x555931233fc2, 0x8)
        github.com/aws/amazon-ssm-agent/agent/framework/processor/processor.go:227 +0x165
github.com/aws/amazon-ssm-agent/agent/session.(*Session).ModuleRequestStop(0xc0004707e0, 0x555931233fc2, 0x8, 0x0, 0x0)
        github.com/aws/amazon-ssm-agent/agent/session/session.go:239 +0x1e6
github.com/aws/amazon-ssm-agent/agent/framework/coremanager.(*CoreManager).stopCoreModules.func1(0x55593165e720, 0xc000459140, 0x555931233fc2, 0x8, 0xc0005e4040, 0xc000611990, 0x2)
        github.com/aws/amazon-ssm-agent/agent/framework/coremanager/coremanager.go:258 +0x107
created by github.com/aws/amazon-ssm-agent/agent/framework/coremanager.(*CoreManager).stopCoreModules
        github.com/aws/amazon-ssm-agent/agent/framework/coremanager/coremanager.go:245 +0x17e

2021-07-09 15:08:26 ERROR [ssm-agent-worker] [instanceID=mi-02cd7937012d59b34] Core module stop request panic: close of closed channel
2021-07-09 15:08:26 ERROR [ssm-agent-worker] [instanceID=mi-02cd7937012d59b34] Stacktrace:
goroutine 282 [running]:
runtime/debug.Stack(0xc000459140, 0xc0008a4000, 0x57)
        runtime/debug/stack.go:24 +0x9f
github.com/aws/amazon-ssm-agent/agent/framework/coremanager.(*CoreManager).stopCoreModules.func1.1(0x55593165e720, 0xc000459140)
        github.com/aws/amazon-ssm-agent/agent/framework/coremanager/coremanager.go:249 +0xda
panic(0x55593149dbc0, 0x55593160de80)
        runtime/panic.go:969 +0x1b9
github.com/aws/amazon-ssm-agent/agent/framework/processor.(*EngineProcessor).Stop(0xc00033e460, 0x555931233fc2, 0x8)
        github.com/aws/amazon-ssm-agent/agent/framework/processor/processor.go:227 +0x165
github.com/aws/amazon-ssm-agent/agent/runcommand.(*RunCommandService).ModuleRequestStop(0xc0003420f0, 0x555931233fc2, 0x8, 0x61, 0x36e)
        github.com/aws/amazon-ssm-agent/agent/runcommand/coremodule.go:100 +0x62
github.com/aws/amazon-ssm-agent/agent/framework/coremanager.(*CoreManager).stopCoreModules.func1(0x55593165e720, 0xc000459140, 0x555931233fc2, 0x8, 0xc0005e4040, 0xc000611990, 0x3)
        github.com/aws/amazon-ssm-agent/agent/framework/coremanager/coremanager.go:258 +0x107
created by github.com/aws/amazon-ssm-agent/agent/framework/coremanager.(*CoreManager).stopCoreModules
        github.com/aws/amazon-ssm-agent/agent/framework/coremanager/coremanager.go:245 +0x17e

2021-07-09 15:08:31 INFO [ssm-agent-worker] Bye.
2021-07-09 15:08:33 INFO [amazon-ssm-agent] Bye.
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] [DataBackend] [pluginName=Standard_Stream] caught signal to terminate: terminated
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] [DataBackend] [pluginName=Standard_Stream] Stopping pty
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] [DataBackend] [pluginName=Standard_Stream] The session was cancelled
2021-07-09 15:08:34 ERROR [ssm-session-worker] [root-071ee3bf5fdf1b05c] [DataBackend] [pluginName=Standard_Stream] unable to wait pty: signal: hangup
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] [DataBackend] [pluginName=Standard_Stream] Stopping pty
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] [DataBackend] [pluginName=Standard_Stream] Closing datachannel withchannel Id root-071ee3bf5fdf1b05c
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] [DataBackend] [pluginName=Standard_Stream] Closing websocket channel connection to: wss://ssmmessages.eu-central-1.amazonaws.com/v1/data-channel/root-071ee3bf5fdf1b05c?role=publish_subscribe
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] [DataBackend] [pluginName=Standard_Stream] Ending websocket pinger
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] [DataBackend] [pluginName=Standard_Stream] Ending the channel listening routine since the channel is closed
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] [DataBackend] [pluginName=Standard_Stream] Ending websocket listener
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] [DataBackend] [pluginName=Standard_Stream] Successfully closed websocket connection to: 52.94.205.120:443
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] [DataBackend] Sending plugin Standard_Stream completion message
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] [DataBackend] document execution complete
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] [DataBackend] sending document complete response...
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] [DataBackend] stopping ipc worker...
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] requested shutdown, prepare to stop messaging
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] channel /var/lib/amazon/ssm/mi-02cd7937012d59b34/channels/root-071ee3bf5fdf1b05c requested close
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] channel /var/lib/amazon/ssm/mi-02cd7937012d59b34/channels/root-071ee3bf5fdf1b05c closed
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] ipc channel closed, stop messaging worker
2021-07-09 15:08:34 INFO [ssm-session-worker] [root-071ee3bf5fdf1b05c] Session worker closed
ferkhat-aws commented 3 years ago

Hello. I'm interested in the other custom document that has the aws:runDocument step. Can you share that with us to that we can try to reproduce the issue on our end and investigate?

dorinclisu commented 3 years ago

OK, I found a minimum replicable setup:


Reboot

---
schemaVersion: "2.2"
description: "Reboot instance"
mainSteps:
- action: "aws:runShellScript"
  name: "reboot"
  inputs:
    runCommand:
    - |
      UPTIME_SECONDS=`awk '{print $1}' /proc/uptime`
      UPTIME_SECONDS=${UPTIME_SECONDS%.*}
      echo "Uptime: $UPTIME_SECONDS seconds"
      if [ $UPTIME_SECONDS -gt 60 ]; then
        echo "rebooting ..."
        exit 194
      else
        echo "rebooted"
      fi

TestShell

---
schemaVersion: "2.2"
description: "Test shell"
mainSteps:
- action: "aws:runShellScript"
  name: "test"
  inputs:
    runCommand:
    - "echo 'test'"
    - "sleep 10"
    - "echo 'test end'"

TestSteps

---
schemaVersion: "2.2"
description: "Run a series of steps."
mainSteps:
- action: "aws:runShellScript"
  name: "s1"
  inputs:
    runCommand:
    - "echo 'step 1'"
    - "sleep 10"
    - "echo 'step 1 end'"

- action: "aws:runDocument"
  name: "reboot"
  inputs:
    documentType: "SSMDocument"
    documentPath: "Reboot"

- action: "aws:runShellScript"
  name: "s2"
  inputs:
    runCommand:
    - |
      echo 'step 2'
      sleep 60
      echo 'step 2 end'

- action: "aws:runDocument"
  name: "reboot2"
  inputs:
    documentType: "SSMDocument"
    documentPath: "Reboot"

TestDocuments

---
schemaVersion: "2.2"
description: "Run a series of documents"
mainSteps:
- action: "aws:runDocument"
  name: "steps"
  inputs:
    documentType: "SSMDocument"
    documentPath: "TestSteps"

- action: "aws:runDocument"
  name: "shell"
  inputs:
    documentType: "SSMDocument"
    documentPath: "TestShell"

Each of these documents runs ok individually, except TestDocuments the one that results in a crash and reboot loop. Notice the shell sleep statements which are key to reproduction, as they mimic package installs or other processes taking a considerable time to complete.

One thing I forgot to mention about the crash - the command output never appears (not even after 30 minutes):

Screenshot 2021-07-16 at 10 39 07
dorinclisu commented 3 years ago

Another strange thing that may or may not have something to do with the problem, is the output from the TestSteps document, which although runs ok in the end, looks like this on the reboot steps:

Uptime: 78 seconds
rebooting ...
Uptime: 78 seconds
rebooting ...
Uptime: 19 seconds
rebooted

The fact that "Uptime: 78 seconds" shows up twice worries me and suggests that the document runs the same script concurrently prior to the actual OS reboot, as if triggered from different threads.

ferkhat-aws commented 3 years ago

It looks like the agent doesn't know how to behave correctly when a restart step is part of a "nested" document. Our team will investigate the issue further and try to find a fix. Thank you for raising this issue.

ferkhat-aws commented 3 years ago

Is it possible for you to include the reboot step as part of the head document? The Agent should be able to handle it without any issue. Also, if you can flatten all of the documents into one single document, that could work as well. These are just suggestions to mitigate the issue for you asap.

dorinclisu commented 3 years ago

Thanks, I can workaround with your suggestion. The nesting is needed merely to stay DRY as the documents are re-used.

dorinclisu commented 2 years ago

So which agent version fixes it?