Closed dorinclisu closed 2 years ago
Hello. I'm interested in the other custom document that has the aws:runDocument
step. Can you share that with us to that we can try to reproduce the issue on our end and investigate?
OK, I found a minimum replicable setup:
Reboot
---
schemaVersion: "2.2"
description: "Reboot instance"
mainSteps:
- action: "aws:runShellScript"
name: "reboot"
inputs:
runCommand:
- |
UPTIME_SECONDS=`awk '{print $1}' /proc/uptime`
UPTIME_SECONDS=${UPTIME_SECONDS%.*}
echo "Uptime: $UPTIME_SECONDS seconds"
if [ $UPTIME_SECONDS -gt 60 ]; then
echo "rebooting ..."
exit 194
else
echo "rebooted"
fi
TestShell
---
schemaVersion: "2.2"
description: "Test shell"
mainSteps:
- action: "aws:runShellScript"
name: "test"
inputs:
runCommand:
- "echo 'test'"
- "sleep 10"
- "echo 'test end'"
TestSteps
---
schemaVersion: "2.2"
description: "Run a series of steps."
mainSteps:
- action: "aws:runShellScript"
name: "s1"
inputs:
runCommand:
- "echo 'step 1'"
- "sleep 10"
- "echo 'step 1 end'"
- action: "aws:runDocument"
name: "reboot"
inputs:
documentType: "SSMDocument"
documentPath: "Reboot"
- action: "aws:runShellScript"
name: "s2"
inputs:
runCommand:
- |
echo 'step 2'
sleep 60
echo 'step 2 end'
- action: "aws:runDocument"
name: "reboot2"
inputs:
documentType: "SSMDocument"
documentPath: "Reboot"
TestDocuments
---
schemaVersion: "2.2"
description: "Run a series of documents"
mainSteps:
- action: "aws:runDocument"
name: "steps"
inputs:
documentType: "SSMDocument"
documentPath: "TestSteps"
- action: "aws:runDocument"
name: "shell"
inputs:
documentType: "SSMDocument"
documentPath: "TestShell"
Each of these documents runs ok individually, except TestDocuments
the one that results in a crash and reboot loop.
Notice the shell sleep
statements which are key to reproduction, as they mimic package installs or other processes taking a considerable time to complete.
One thing I forgot to mention about the crash - the command output never appears (not even after 30 minutes):
Another strange thing that may or may not have something to do with the problem, is the output from the TestSteps
document, which although runs ok in the end, looks like this on the reboot steps:
Uptime: 78 seconds
rebooting ...
Uptime: 78 seconds
rebooting ...
Uptime: 19 seconds
rebooted
The fact that "Uptime: 78 seconds" shows up twice worries me and suggests that the document runs the same script concurrently prior to the actual OS reboot, as if triggered from different threads.
It looks like the agent doesn't know how to behave correctly when a restart step is part of a "nested" document. Our team will investigate the issue further and try to find a fix. Thank you for raising this issue.
Is it possible for you to include the reboot step as part of the head document? The Agent should be able to handle it without any issue. Also, if you can flatten all of the documents into one single document, that could work as well. These are just suggestions to mitigate the issue for you asap.
Thanks, I can workaround with your suggestion. The nesting is needed merely to stay DRY as the documents are re-used.
So which agent version fixes it?
OS version: Ubuntu 20.04 Agent version: 3.0.1295.0 Installation: snap channel candidate
I have a custom reboot document:
Running it as a standalone command works fine. Problem is that I also have another custom document, which in turn has some
aws:runDocument
steps and one of the documents in turn calls the reboot document. The instance gets into a reboot loop from which it only recovers after canceling the command from the console and then restarting the ssm agent service a couple of times (from ssh).Here are some relevant logs from the agent showing the go stacktrace: