aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.04k stars 323 forks source link

The `runShellScript` plugin does not handle script failure correctly #252

Open BRMatt opened 4 years ago

BRMatt commented 4 years ago

We have an SSM document for run command that executes chef on our machines, then runs goss to do some smoke tests against the machine to check critical things are still running. We noticed today that the SSM agent will ignore chef's exit code, and only considers goss' exit code.

It seems the issue is that the agent executes these commands by combining them all in a single shell script file, with each command delimited by newlines, then executing that shell script. By default bash uses the exit code of the last command as the exit code of the script, and ignores the other exit codes.

This creates a confusing situation if you have a minimal run command document, and then it expand it later with more commands. For example, if you run this document:

{
  "schemaVersion": "2.2",
  "description": "Test script.",
  "mainSteps": [
    {
      "action": "aws:runShellScript",
      "name": "runChefProvision",
      "inputs": {
        "runCommand": [
          "set -x",
          "false"
        ],
        "workingDirectory": "/var/chef-solo"
      }
    }
  ]
}

Then the run will fail with status code 1:

Image 2020-01-22 at 11 26 44 am

But if you then add an extra command after false that always passes:

{
  "schemaVersion": "2.2",
  "description": "Test script.",
  "mainSteps": [
    {
      "action": "aws:runShellScript",
      "name": "runChefProvision",
      "inputs": {
        "runCommand": [
          "set -x",
          "false",
          "true"
        ],
        "workingDirectory": "/tmp"
      }
    }
  ]
}

The command as a whole will pass:

Image 2020-01-22 at 11 28 15 am

The fact that something that would previously cause the command to fail, suddenly no longer causes the command to fail is quite worrying. I've not been able to find any documentation that describes this pitfall, so I'm guessing it's a bug.

It seems like one way to fix this would be for CreateScriptFile to inject set -e at the top of the script file, which would cause bash to exit if any of the commands in the script fail?

sumohite commented 4 years ago

Hello,

Thank you for reaching out. This is by design since we let the customer decide when to fail or exit using their scripts.

In the above example, you could make use of shell conditional statements to exit the script if the command fails Ex: date if [ $? == 0 ] then echo "succeeded" else echo "failed" exit 1 fi

BRMatt commented 4 years ago

Hi @sumohite thanks for replying.

This is by design

Is there any documentation that describes this behaviour or states this is by design? When I mentioned this to an aws user group community back in January several users of run command were surprised by this behaviour and had to go off and check their documents.

I would suggest that accepting a list of commands to execute in a json array is so far removed from the implementation of how these commands are executed, that it's unreasonable to expect someone to know that they need to manually check the exit code of each command individually, especially as I haven't been able to find any documentation that mentions this caveat.

I could understand this perspective it if the plugin only accepted an entire bash script as a string, but the preferred approach seems to be to define the commands in an array, which doesn't make it easy to use the if/then/else checks you reference.

In our situation, we had a run command document with a single command that ran chef against a machine. If chef failed, the run command execution failed. This was great, and worked as expected.

We wanted to run some smoke tests after chef finished that verified the health of the machine. When we added the second command our chef command was no longer being checked for failure. We actually had an incident because the chef commands failed and SSM continued applying the broken chef change to our entire fleet rather than halting after the error threshold we configured was reached.

justinc1 commented 3 years ago

Documentation: https://docs.aws.amazon.com/systems-manager/latest/userguide/troubleshooting-remote-commands.html , section "A step in my script failed, but the overall status is 'succeeded'". It is a misleading that AWS examples do not indicate this behaviour. I guess bash 'set -e' would be helpful here.