ansible-collections / community.aws

Ansible Collection for Community AWS
GNU General Public License v3.0
189 stars 398 forks source link

ssm connection plugin fails at gathering facts #113

Closed MVJosh closed 1 year ago

MVJosh commented 4 years ago
SUMMARY

Using the ssm plugin, playbooks fail at the gathering facts stage, with:

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ValueError: invalid literal for int() with base 10: "echo $'\\n'$?"
fatal: [i-xxxxx]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}

If using gather_facts: false, the same error occurs at the first task

As far as I can tell, SSM is configured correctly: I can run aws ssm start-session --target i-xxxxx from my ansible host and successfully get a shell on the target host.

ISSUE TYPE
COMPONENT NAME

ssm connection plugin

ANSIBLE VERSION
ansible 2.9.9
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/ubuntu/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/ubuntu/.local/lib/python3.8/site-packages/ansible
  executable location = /home/ubuntu/.local/bin/ansible
  python version = 3.8.2 (default, Apr 27 2020, 15:53:34) [GCC 9.3.0]
CONFIGURATION
INVENTORY_ENABLED(/etc/ansible/ansible.cfg) = ['aws_ec2']
OS / ENVIRONMENT

Both the instance running Ansible and the target (SSM-managed) instance are regular ubuntu 20.04 AMIs

STEPS TO REPRODUCE
# Playbook

- hosts: all
  vars:
    ansible_connection: community.aws.aws_ssm
    ansible_aws_ssm_region: eu-west-2 # substitute for your region
  tasks:
    - name: test
      command:
        cmd: ls -l
# aws_ec2.yaml inventory

plugin: aws_ec2
regions:
  - eu-west-2
keyed_groups:
  - prefix: tag
    key: tags
  - prefix: aws_region
    key: placement.region
hostnames:
  - instance-id
compose:
  ansible_host: instance-id

Run:

ansible-playbook -i aws_ec2.yaml -c community.aws.aws_ssm test-playbook.yaml
EXPECTED RESULTS

Playbook runs successfully with no errors

ACTUAL RESULTS

Playbook errors on gathering facts stage, with:

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ValueError: invalid literal for int() with base 10: "echo $'\\n'$?"
fatal: [i-xxxxx]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}

with -vvvv verbosity:

ansible-playbook 2.9.9
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/ubuntu/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/ubuntu/.local/lib/python3.8/site-packages/ansible
  executable location = /home/ubuntu/.local/bin/ansible-playbook
  python version = 3.8.2 (default, Apr 27 2020, 15:53:34) [GCC 9.3.0]
Using /etc/ansible/ansible.cfg as config file
setting up inventory plugins
Parsed /home/ubuntu/aws_ec2.yaml inventory source with aws_ec2 plugin
Loading callback plugin default of type stdout, v2.0 from /home/ubuntu/.local/lib/python3.8/site-packages/ansible/plugins/callback/default.py

PLAYBOOK: test_ssm.yaml ****************************************************************************************************************************************************************************************
Positional arguments: test_ssm.yaml
verbosity: 4
connection: community.aws.aws_ssm
timeout: 10
become_method: sudo
tags: ('all',)
inventory: ('/home/ubuntu/aws_ec2.yaml',)
forks: 5
1 plays in test_ssm.yaml

PLAY [all] *****************************************************************************************************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************************************************************************************
task path: /home/ubuntu/test_ssm.yaml:1
<i-xxxxx> ESTABLISH SSM CONNECTION TO: i-xxxxx
<i-xxxxx> SSM COMMAND: ['/usr/local/bin/session-manager-plugin', '{"SessionId": "i-xxxxx-yyyyy", "TokenValue": "XXXXX", "StreamUrl": "wss://ssmmessages.eu-west-2.amazonaws.com/v1/data-channel/i-yyyyy-zzzzz?role=publish_subscribe", "ResponseMetadata": {"RequestId": "xxxx-xxxx-xxxx-xxxx", "HTTPStatusCode": 200, "HTTPHeaders": {"x-amzn-requestid": "xxxx-xxxx-xxxx-xxxx-xxxx", "content-type": "application/x-amz-json-1.1", "content-length": "626", "date": "Fri, 19 Jun 2020 14:41:52 GMT"}, "RetryAttempts": 0}}', 'eu-west-2', 'StartSession', '', '{"Target": "i-xxxxx"}', 'https://ssm.eu-west-2.amazonaws.com']
<i-xxxxx> SSM CONNECTION ID: i-xxxx-xxxxx
<i-xxxxx> EXEC echo ~
<i-xxxxx> _wrap_command: 'echo XXXXX
echo ~
echo $'\n'$?
echo YYYYY
'
<i-xxxxx> EXEC stdout line: 
<i-xxxxx> EXEC stdout line: Starting session with SessionId: i-xxxxx-yyyyy
<i-xxxxx> EXEC remaining: 60
<i-xxxxx> EXEC stdout line: $ stty -echo
<i-xxxxx> EXEC stdout line: PS1=''
<i-xxxxx> EXEC stdout line: echo XXXXX
<i-xxxxx> EXEC stdout line: echo ~
<i-xxxxx> EXEC stdout line: echo $'\n'$?
<i-xxxxx> EXEC stdout line: echo YYYYY
<i-xxxxx> POST_PROCESS: echo ~
echo $'\n'$?
<i-xxxxx> ssm_retry: attempt: 0, caught exception(invalid literal for int() with base 10: "echo $'\\n'$?") from cmd (echo ~...), pausing for 0 seconds
<i-xxxxx> CLOSING SSM CONNECTION TO: i-xxxxx
<i-xxxxx> TERMINATE SSM SESSION: i-xxxxx-yyyyy
<i-xxxxx> ESTABLISH SSM CONNECTION TO: i-xxxxx
<i-xxxxx> SSM COMMAND: ['/usr/local/bin/session-manager-plugin', '{"SessionId": "i-xxxxx-yyyyy", "TokenValue": "XXXXX", "StreamUrl": "wss://ssmmessages.eu-west-2.amazonaws.com/v1/data-channel/i-yyyyy-zzzzz?role=publish_subscribe", "ResponseMetadata": {"RequestId": "xxxx-xxxx-xxxx-xxxx-xxxx", "HTTPStatusCode": 200, "HTTPHeaders": {"x-amzn-requestid": "xxxx-xxxx-xxxx-xxxx-xxxx", "content-type": "application/x-amz-json-1.1", "content-length": "626", "date": "Fri, 19 Jun 2020 14:41:53 GMT"}, "RetryAttempts": 0}}', 'eu-west-2', 'StartSession', '', '{"Target": "i-xxxxx"}', 'https://ssm.eu-west-2.amazonaws.com']
<i-xxxxx> SSM CONNECTION ID: i-xxxx-xxxx
<i-xxxxx> EXEC echo ~
<i-xxxxx> _wrap_command: 'echo YYYYY
echo ~
echo $'\n'$?
echo ZZZZZ
'
<i-xxxxx> EXEC stdout line: 
<i-xxxxx> EXEC stdout line: Starting session with SessionId: i-xxxx-yyyy
<i-xxxxx> EXEC remaining: 60
<i-xxxxx> EXEC stdout line: $ stty -echo
<i-xxxxx> EXEC stdout line: PS1=''
<i-xxxxx> EXEC stdout line: echo YYYYY
<i-xxxxx> EXEC stdout line: echo ~
<i-xxxxx> EXEC stdout line: echo $'\n'$?
<i-xxxxx> EXEC stdout line: echo ZZZZZ
<i-xxxxx> POST_PROCESS: echo ~
echo $'\n'$?
<i-xxxxx> CLOSING SSM CONNECTION TO: i-xxxxx
<i-xxxxx> TERMINATE SSM SESSION: i-xxxxx-yyyyy
The full traceback is:
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ansible/executor/task_executor.py", line 146, in run
    res = self._execute()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ansible/executor/task_executor.py", line 645, in _execute
    result = self._handler.run(task_vars=variables)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ansible/plugins/action/gather_facts.py", line 79, in run
    res = self._execute_module(module_name=fact_module, module_args=mod_args, task_vars=task_vars, wrap_async=False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ansible/plugins/action/__init__.py", line 780, in _execute_module
    self._make_tmp_path()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ansible/plugins/action/__init__.py", line 343, in _make_tmp_path
    tmpdir = self._remote_expand_user(self.get_shell_option('remote_tmp', default='~/.ansible/tmp'), sudoable=False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ansible/plugins/action/__init__.py", line 664, in _remote_expand_user
    data = self._low_level_execute_command(cmd, sudoable=False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/ansible/plugins/action/__init__.py", line 1075, in _low_level_execute_command
    rc, stdout, stderr = self._connection.exec_command(cmd, in_data=in_data, sudoable=sudoable)
  File "/home/ubuntu/.ansible/collections/ansible_collections/community/aws/plugins/connection/aws_ssm.py", line 197, in wrapped
    return_tuple = func(self, *args, **kwargs)
  File "/home/ubuntu/.ansible/collections/ansible_collections/community/aws/plugins/connection/aws_ssm.py", line 389, in exec_command
    returncode, stdout = self._post_process(stdout, mark_begin)
  File "/home/ubuntu/.ansible/collections/ansible_collections/community/aws/plugins/connection/aws_ssm.py", line 442, in _post_process
    returncode = int(stdout.splitlines()[-2])
ValueError: invalid literal for int() with base 10: "echo $'\\n'$?"
fatal: [i-xxxxx]: FAILED! => {
    "msg": "Unexpected failure during module execution.",
    "stdout": ""
}
ilyavaiser commented 4 years ago

Same issue, unusual plugin

abeluck commented 4 years ago

I looked into this, and there are several things wrong I think.

See this line

            cmd = "echo " + mark_start + "\n" + cmd + "\necho $'\\n'$?\n" + "echo " + mark_end + "\n"

The shell used by the session-manager is sh, which does not support bashisms like $'\n' to print a new line. See the difference:

sh

$ echo $'\n'$?
$
0

bash

$ echo $'\n'$?

0

However, that might be a red herring. Notice the output:

<i-xxxxx> EXEC stdout line: $ stty -echo
<i-xxxxx> EXEC stdout line: PS1=''
<i-xxxxx> EXEC stdout line: echo YYYYY
<i-xxxxx> EXEC stdout line: echo ~
<i-xxxxx> EXEC stdout line: echo $'\n'$?
<i-xxxxx> EXEC stdout line: echo ZZZZZ

That is the input commands, not the stdout. Where is the output?

abeluck commented 4 years ago

Here is a workaround that avoids sh entirely by starting a bash shell.

--- a/plugins/connection/aws_ssm.py 2020-09-03 18:43:43.818000000 +0200
+++ b/plugins/connection/aws_ssm.py 2020-09-03 18:43:19.805000000 +0200
@@ -288,11 +288,17 @@

         profile_name = ''
         region_name = self.get_option('region')
-        ssm_parameters = dict()

         client = boto3.client('ssm', region_name=region_name)
         self._client = client
-        response = client.start_session(Target=self.instance_id, Parameters=ssm_parameters)
+
+        if self.is_windows:
+            ssm_parameters = dict()
+            response = client.start_session(Target=self.instance_id, Parameters=ssm_parameters)
+        else:
+            ssm_parameters = {"command": ["bash -l"]}
+            response = client.start_session(Target=self.instance_id, DocumentName="AWS-StartInteractiveCommand", Parameters=ssm_parameters)
+
         self._session_id = response['SessionId']

         cmd = [
DanielFallon commented 4 years ago

I looked at this for a bit today and found that you were right to be skeptical @abeluck .

The issue appears to actually be in _prepare_terminal. The current code does not wait for tty -echo to return before sending additional bytes. This causes all characters that are sent prior to the return of tty -echo to be printed and any characters written afterwards to not be printed.

Since the commands are sent so quickly, the terminal prints all lines. Then we break out of the loop because <start_mark> is definitely present in echo <start_mark>. You can quickly fix the behavior by adding a delay (and also probably the fix for the bashism) Here are some logs to demonstrate:

After removing the loop breakout:

Loading callback plugin minimal of type stdout, v2.0 from /usr/lib/python3.8/site-packages/ansible/plugins/callback/minimal.py
META: ran handlers
<ip-___-__-_-___.ec2.internal> ESTABLISH SSM CONNECTION TO: i-0cc44a1e2f7995c53
<ip-___-__-_-___.ec2.internal> SSM COMMAND: ['/usr/local/bin/session-manager-plugin',...]
<ip-___-__-_-___.ec2.internal> SSM CONNECTION ID: DanFallon-095d5d09e63ce624c
<ip-___-__-_-___.ec2.internal> EXEC echo ~ubuntu
<ip-___-__-_-___.ec2.internal> _wrap_command: 'echo PKjewgTIHJcLjaXeCjlasDjfZw
echo ~ubuntu
echo $'\n'$?
echo uZmXXhFRmkCzCEZEPjWMQlNiIF
'
<ip-___-__-_-___.ec2.internal> EXEC stdout line: 
<ip-___-__-_-___.ec2.internal> EXEC stdout line: Starting session with SessionId: DanFallon-095d5d09e63ce624c
<ip-___-__-_-___.ec2.internal> EXEC remaining: 60
<ip-___-__-_-___.ec2.internal> EXEC stdout line: $ stty -echo ; 
<ip-___-__-_-___.ec2.internal> EXEC stdout line: PS1='' ; 
<ip-___-__-_-___.ec2.internal> EXEC stdout line: echo PKjewgTIHJcLjaXeCjlasDjfZw
<ip-___-__-_-___.ec2.internal> EXEC stdout line: echo ~ubuntu
<ip-___-__-_-___.ec2.internal> EXEC stdout line: echo $'\n'$?
<ip-___-__-_-___.ec2.internal> EXEC stdout line: echo uZmXXhFRmkCzCEZEPjWMQlNiIF
<ip-___-__-_-___.ec2.internal> EXEC stdout line: $ PKjewgTIHJcLjaXeCjlasDjfZw
<ip-___-__-_-___.ec2.internal> EXEC stdout line: /home/ubuntu
<ip-___-__-_-___.ec2.internal> EXEC stdout line: $
<ip-___-__-_-___.ec2.internal> EXEC stdout line: 0
<ip-___-__-_-___.ec2.internal> EXEC stdout line: uZmXXhFRmkCzCEZEPjWMQlNiIF
<ip-___-__-_-___.ec2.internal> EXEC remaining: 59
<ip-___-__-_-___.ec2.internal> EXEC remaining: 58
<ip-___-__-_-___.ec2.internal> EXEC remaining: 57
<ip-___-__-_-___.ec2.internal> EXEC remaining: 56
<ip-___-__-_-___.ec2.internal> EXEC remaining: 55
<ip-___-__-_-___.ec2.internal> EXEC remaining: 54
<ip-___-__-_-___.ec2.internal> EXEC remaining: 53
...

After adding a 5 second delay

Loading callback plugin minimal of type stdout, v2.0 from /usr/lib/python3.8/site-packages/ansible/plugins/callback/minimal.py
META: ran handlers
<ip-___-__-_-___.ec2.internal> ESTABLISH SSM CONNECTION TO: i-0cc44a1e2f7995c53
<ip-___-__-_-___.ec2.internal> SSM COMMAND: ['/usr/local/bin/session-manager-plugin', ...]
Sleeping for 5 seconds.
<ip-___-__-_-___.ec2.internal> SSM CONNECTION ID: DanFallon-05193c016e933c229
<ip-___-__-_-___.ec2.internal> EXEC echo ~ubuntu
<ip-___-__-_-___.ec2.internal> _wrap_command: 'echo KylNOeAyCbfmTzlCZGrbHiYnSm
echo ~ubuntu
echo $'\n'$?
echo PujmuNjsUIsmGZzoUwNAUvowBk
'
<ip-___-__-_-___.ec2.internal> EXEC stdout line: 
<ip-___-__-_-___.ec2.internal> EXEC stdout line: Starting session with SessionId: DanFallon-05193c016e933c229
<ip-___-__-_-___.ec2.internal> EXEC stdout line: $ stty -echo ; 
<ip-___-__-_-___.ec2.internal> EXEC stdout line: PS1='' ; 
<ip-___-__-_-___.ec2.internal> EXEC stdout line: $ KylNOeAyCbfmTzlCZGrbHiYnSm
<ip-___-__-_-___.ec2.internal> EXEC stdout line: /home/ubuntu
<ip-___-__-_-___.ec2.internal> EXEC stdout line: $
<ip-___-__-_-___.ec2.internal> EXEC stdout line: 0
<ip-___-__-_-___.ec2.internal> EXEC stdout line: PujmuNjsUIsmGZzoUwNAUvowBk
<ip-___-__-_-___.ec2.internal> EXEC remaining: 60
<ip-___-__-_-___.ec2.internal> EXEC remaining: 59
<ip-___-__-_-___.ec2.internal> EXEC remaining: 58
<ip-___-__-_-___.ec2.internal> EXEC remaining: 57
<ip-___-__-_-___.ec2.internal> EXEC remaining: 56
<ip-___-__-_-___.ec2.internal> EXEC remaining: 55
abeluck commented 4 years ago

Would love to see these fixes land, and maybe get some tests around this plugin (not sure how that would work).

We bailed on the ssm plugin for now and are running ansible over ssh over ssm, which means we still have to manage ssh keys on instances (though we don't need a bastion any longer).

mikeneiderhauser commented 4 years ago

Another strange issue that i see related to this during gathering facts is the AnsiballZ_setup.py that is run does something to mess with the shell and always hits EXEC remaining and always times out after without echoing to stdout the new line and the end mark. My remote host is running an older version of python so I'm initially thinking thats related. Trying to stand up a venv on remote to use as env

DanielFallon commented 4 years ago

Here's a patch that I'm willing to contribute (haven't gotten a chance to set up a fork yet)

Problems I still see:

Anywho, I freely release the below under a GPL V3.0+ license like the files from which the were derived. I can look at figuring out the contributor agreement/etc later this week.

diff --git a/plugins/connection/aws_ssm.py b/plugins/connection/aws_ssm.py
index 7f7d692..4dc3965 100644
--- a/plugins/connection/aws_ssm.py
+++ b/plugins/connection/aws_ssm.py
@@ -412,11 +412,97 @@ class Connection(ConnectionBase):
     def _prepare_terminal(self):
         ''' perform any one-time terminal settings '''

-        if not self.is_windows:
-            cmd = "stty -echo\n" + "PS1=''\n"
-            cmd = to_bytes(cmd, errors='surrogate_or_strict')
-            self._session.stdin.write(cmd)
+        # No windows setup for now
+        if self.is_windows:
+            return
+
+        # *_complete variables are 3 valued:
+        #   - None: not started
+        #   - False: started
+        #   - True: complete
+

+        startup_complete = False
+        startup_reply = re.compile(
+            r"Starting session with SessionId:\W+" +
+            re.escape(self._session_id) +
+            r"\r\n\$ ", re.MULTILINE)
+
+        disable_echo_complete = None
+        disable_echo_cmd = to_bytes("stty -echo\n", errors='surrogate_or_strict')
+        disable_echo_reply = re.compile(
+            r"stty \-echo" +
+            r"\r\r\n\$", re.MULTILINE
+        )
+
+        disable_prompt_complete = None
+        end_mark = "".join([random.choice(string.ascii_letters) for i in xrange(self.MARK_LENGTH)])
+        disable_prompt_cmd = to_bytes(
+            "PS1='' ; printf '\\n%s\\n' '" + end_mark + "'\n",
+            errors='surrogate_or_strict')
+        disable_prompt_reply = re.compile(
+            r"\r\r\n" +
+            re.escape(end_mark) +
+            r"\r\r\n", re.MULTILINE
+        )
+
+        stdout = ""
+        cursor = 0
+        # Custom command execution for when we're waiting for startup
+        stop_time = int(round(time.time())) + self.get_option('ssm_timeout')
+        while (not disable_prompt_complete) and (self._session.poll() is None):
+            remaining = stop_time - int(round(time.time()))
+            if remaining < 1:
+                self._timeout = True
+                display.vvvv(u"PRE timeout stdout: {0}".format(to_bytes(stdout)), host=self.host)
+                raise AnsibleConnectionFailure("SSM start_session timeout on host: %s"
+                                               % self.instance_id)
+            if self._poll_stdout.poll(1000):
+                stdout += to_text(self._stdout.read(1024))
+                display.vvvv(u"PRE stdout line: {0}".format(to_bytes(stdout)), host=self.host)
+            else:
+                display.vvvv(u"PRE remaining: {0}".format(remaining), host=self.host)
+
+            # wait til prompt is ready
+            if startup_complete is False:
+                match = startup_reply.search(stdout,cursor)
+                if match:
+                    display.vvvv(u"PRE startup output received", host=self.host)
+                    cursor = match.end()
+                    startup_complete = True
+                    
+
+            # disable echo
+            if startup_complete and (disable_echo_complete is None):
+                display.vvvv(u"PRE Disabling Echo: {0}".format(disable_echo_cmd), host=self.host)
+                self._session.stdin.write(disable_echo_cmd)
+                disable_echo_complete = False
+
+            if disable_echo_complete is False:
+                match = disable_echo_reply.search(stdout)
+                if match:
+                    stdout = stdout[match.end():]
+                    disable_echo_complete = True
+            
+
+            # disable prompt
+            if disable_echo_complete and disable_prompt_complete is None:
+                display.vvvv(u"PRE Disabling Prompt: {0}".format(disable_prompt_cmd), host=self.host)
+                self._session.stdin.write(disable_prompt_cmd)
+                disable_prompt_complete = False
+
+            if disable_prompt_complete is False:
+                match = disable_prompt_reply.search(stdout)
+                if match:
+                    stdout = stdout[match.end():]
+                    disable_prompt_complete = True
+        
+        if not disable_prompt_complete:
+            raise AnsibleConnectionFailure("SSM process closed during _prepare_terminal on host: %s"
+                                               % self.instance_id)
+        else:
+            display.vvv(u"PRE Terminal configured", host=self.host)
+            
     def _wrap_command(self, cmd, sudoable, mark_start, mark_end):
         ''' wrap command so stdout and status can be extracted '''

@@ -427,7 +513,9 @@ class Connection(ConnectionBase):
         else:
             if sudoable:
                 cmd = "sudo " + cmd
-            cmd = "echo " + mark_start + "\n" + cmd + "\necho $'\\n'$?\n" + "echo " + mark_end + "\n"
+            cmd = ("printf '%s\\n' '{0}' ;\n".format(mark_start) +
+                cmd + 
+                " ;\nprintf '\\n%s\\n%s\\n' \"$?\" '{0}' ;\n".format(mark_end))

         display.vvvv(u"_wrap_command: '{0}'".format(to_text(cmd)), host=self.host)
         return cmd
mikeneiderhauser commented 4 years ago

Im willing to try it because im stuck. My executing node is running py3. My endpoint is only running 2.7 (legacy os). So the plugin being py3 only should be fine. The py2 issues may just be with what ansible packages up and sends over to the endpoint

mikeneiderhauser commented 4 years ago

@DanielFallon thanks for the patch. Had to change your ssm_timeout to timeout for the self.get_option call.

upon executing im getting stuck in a PRE remaining loop eg prompt not ready during gathering facts

DanielFallon commented 4 years ago

The change in get_option call is because my patch was for the development version instead of the currently released version (they just changed it from timeout to ssm_timeout)

can you post a redacted log with debug level at least 4? (-vvvv) I'd like to see which part of the setup loop it's getting stuck in.

There are sort of 3 phases: first, waits for it to receive the string:

Starting session with SessionId:

second, waits to receive the string:

stty -echo
$

and the third wait's to receive a custom generated end string:

<end_mark>

I expected this code to be a bit fragile but fail safe. Maybe I should add some more detail to the timeout exception

mikeneiderhauser commented 4 years ago

I can grab you a log, but may take some time, my inventory script now isn't populating as expected which sets up the s3 bucket name, and instance id.

DanielFallon commented 4 years ago

it's also worth noting that I only tested this with a sort of hello world example. I wouldn't be surprised if there are still things that don't work. I do think that some sort of scripted regression testing would be very valuable for this as it is fragile and could be broken by:

all things to enumerate and test

mikeneiderhauser commented 4 years ago

Does your patch apply to the 1.2 tagged release or on the main branch?

mikeneiderhauser commented 4 years ago

Please see sanitized output. I took stock 1.2 and added just your patch and got the following

$ ansible-playbook $ANS_VAULT_FLAGS --limit TEST-GRP --user testuser --tags debug-vars common_aws_node.yml -vvvvvv
ansible-playbook 2.9.9
  config file = /home/user/ansibletest/ansible.cfg
  configured module search path = ['/home/user/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/user/venvs/ansible/lib64/python3.6/site-packages/ansible
  executable location = /home/user/venvs/ansible/bin/ansible-playbook
  python version = 3.6.8 (default, Apr  2 2020, 13:34:55) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
Using /home/user/ansibletest/ansible.cfg as config file
Reading vault password file: XXXX
Reading vault password file: XXXX
Reading vault password file: XXXX
setting up inventory plugins
host_list declined parsing /home/user/ansibletest/inventories/dynamic_inv.yml as it did not pass its verify_file() method
Start parsing DYN inventory
Blacklisted hosts
Done parsing DYN inventory
Parsed /home/user/ansibletest/inventories/dynamic_inv.yml inventory source with auto plugin
statically imported: /home/user/ansibletest/roles/basic-provisioning/tasks/10_install_start_aws_ssm.yml
Loading callback plugin default of type stdout, v2.0 from /home/user/venvs/ansible/lib64/python3.6/site-packages/ansible/plugins/callback/default.py

PLAYBOOK: common_aws_node.yml ************************************************************************************************************************************************************************************************************************
Positional arguments: common_aws_node.yml
verbosity: 6
remote_user: testuser
connection: smart
timeout: 10
become_method: sudo
tags: ('debug-vars',)
inventory: ('/home/user/ansibletest/inventories/dynamic_inv.yml',)
subset: TEST-GRP
vault_ids: XXXX
forks: 5
7 plays in common_aws_node.yml

PLAY [provision] *************************************************************************************************************************************************************************************************************************************
META: ran handlers
META: ran handlers
META: ran handlers

--- CUT --- (Skipped roles)

PLAY [debug-vars] ************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] *******************************************************************************************************************************************************************************************************************************
task path: /home/user/ansibletest/common_aws_node.yml:59
<10.x.x.x> ESTABLISH SSM CONNECTION TO: i-0AWS_INSTANCEID
<10.x.x.x> SSM COMMAND: ['/usr/local/bin/session-manager-plugin', '{"SessionId": "user@domain.com-xxxx", "TokenValue": "XXXCUTXXX", "StreamUrl": "wss://ssmmessages.us-east-1.amazonaws.com/v1/data-channel/user@domain.com-xxxx?role=publish_subscribe", "ResponseMetadata": {"RequestId": "XXX REQ ID XXX", "HTTPStatusCode": 200, "HTTPHeaders": {"server": "Server", "date": "Fri, 02 Oct 2020 14:49:59 GMT", "content-type": "application/x-amz-json-1.1", "content-length": "668", "connection": "keep-alive", "x-amzn-requestid": "XXX amzn id XXX"}, "RetryAttempts": 0}}', 'us-east-1', 'StartSession', '', '{"Target": "i-0AWS_INSTANCEID"}', 'https://ssm.us-east-1.amazonaws.com']
<10.x.x.x> PRE stdout line: b'\r\nStarting session with SessionId: user@domain.com-xxxx\r\n'
<10.x.x.x> PRE remaining: 60
<10.x.x.x> PRE stdout line: b'\r\nStarting session with SessionId: user@domain.com-xxxx\r\n\x1b[?1034hsh-4.2$ '
<10.x.x.x> PRE remaining: 59
<10.x.x.x> PRE remaining: 58
--- CUT ---
<10.x.x.x> PRE remaining: 3
<10.x.x.x> PRE remaining: 2
<10.x.x.x> PRE remaining: 1
<10.x.x.x> PRE timeout stdout: b'\r\nStarting session with SessionId: user@domain.com-xxxx\r\n\x1b[?1034hsh-4.2$ '
<10.x.x.x> ssm_retry: attempt: 0, cmd (echo ~testuser...), pausing for 0 seconds
<10.x.x.x> CLOSING SSM CONNECTION TO: i-0AWS_INSTANCEID
<10.x.x.x> TERMINATE SSM SESSION: user@domain.com-xxxx
<10.x.x.x> ESTABLISH SSM CONNECTION TO: i-0AWS_INSTANCEID
<10.x.x.x> SSM COMMAND: ['/usr/local/bin/session-manager-plugin', '{"SessionId": "user@domain.com-xxxx", "TokenValue": "XXXCUTXXX", "StreamUrl": "wss://ssmmessages.us-east-1.amazonaws.com/v1/data-channel/user@domain.com-xxxx?role=publish_subscribe", "ResponseMetadata": {"RequestId": "XXX REQ ID XXX", "HTTPStatusCode": 200, "HTTPHeaders": {"server": "Server", "date": "Fri, 02 Oct 2020 14:51:00 GMT", "content-type": "application/x-amz-json-1.1", "content-length": "668", "connection": "keep-alive", "x-amzn-requestid": "XXX amzn id XXX"}, "RetryAttempts": 0}}', 'us-east-1', 'StartSession', '', '{"Target": "i-0AWS_INSTANCEID"}', 'https://ssm.us-east-1.amazonaws.com']
<10.x.x.x> PRE stdout line: b'\r\nStarting session with SessionId: user@domain.com-xxxx\r\n'
<10.x.x.x> PRE remaining: 60
<10.x.x.x> PRE stdout line: b'\r\nStarting session with SessionId: user@domain.com-xxxx\r\n\x1b[?1034hsh-4.2$ '
<10.x.x.x> PRE remaining: 59
<10.x.x.x> PRE remaining: 58

I also added the bash shell patch as well (but changed bash -l to bash --noprofile -l) with a similar result

mikeneiderhauser commented 4 years ago

I modified some of the re's that are in the patch (mainly removing the \r\n's). Now I'm hitting the same issue. The AnsiballZ_setup.py gets transferred and executed. It prints out the json for the facts, but i never get the end mark. As in I never see the second printf output

sudo sudo -H -S -n  -u root /bin/sh -c 'echo BECOME-SUCCESS-lzrbkhycxnugzysliwrhdukemyhhtyde ; /usr/bin/python /home/maintuser/.ansible/tmp/ansible-tmp-1601664494.6311371-10330-41749016017816/AnsiballZ_setup.py' ;
printf '\n%s\n%s\n' "$?" 'icjRpqBJJHvCxBxLeqZMYeJKhQ' ;

Then back to the retry loop

DanielFallon commented 4 years ago

the bash patch is definitely incompatible with mine. (because the regex would definitely have to change.)

does python ansible-tmp-1601664494.6311371-10330-41749016017816/AnsiballZ_setup.py open stdin? if yes, then it might be eating the printf command (meaning that the shell would never receive it.)

This is one of the things I was worried about in my comment above. I kind of want to look at some other connection plugins and see how they handle this, because I think a lot of pieces of this are pretty fragile and I want to see others ideas.

maybe instead of clearing the prompt, we could set it to an end_mark so that the shell will print it.

mikeneiderhauser commented 4 years ago

I tried both sh and bash (with and w/o modding the re's) and both have similar behavior. I do think that AnsiballZ_setup.py is doing something to stdin.

DanielFallon commented 4 years ago

This is not an SSM specific solution, but I'm tempted to try out using SendSSHPublicKey from the EC2 instance connect api and then wrapping the ssh connection library instead. Feels like it might be generally more stable (and would support sftp as well)

SSM might still be necessary for the tunnel, but I'm not sure piping commands to stdin is ever really going to be partiuclarly stable.

Any reason why ssmStartSession is superior that I should know about?

DanielFallon commented 4 years ago

Looking at this a bit closer, it looks like the right way to address this if use of SSM is wanted is to add a parameter like ssm_exec_document_name. I had not gotten a chance to use SSM much prior to now and wasn't super familiar with its api. The document name in the current version of the api can define most of the behavior of how an interactive session starts, although it unfortunately seems to still start with a raw shell (see here: https://github.com/aws/amazon-ssm-agent/blob/b9654b268afcb7e70a9cc6c6d9b7d2a676f5b468/agent/session/plugins/shell/shell_unix.go#L53 )

Because customization of the shell is so fragile, it's probably worthwhile to provide a sane default and then allow users to override the --document-name provided to the session to configure things further. To facilitate this, the important part is to establish a protocol for communicating stdout and the exit code back to the client (preferably one that is slightly more robust than what is present now so that a user can always produce working results for their system)

I'll give it another half an hour today and post results

mikeneiderhauser commented 4 years ago

@DanielFallon would be interested to see your results. Seems to be a step in the right direction.

jiba21 commented 4 years ago

Here is a workaround that avoids sh entirely by starting a bash shell.

--- a/plugins/connection/aws_ssm.py   2020-09-03 18:43:43.818000000 +0200
+++ b/plugins/connection/aws_ssm.py   2020-09-03 18:43:19.805000000 +0200
@@ -288,11 +288,17 @@

         profile_name = ''
         region_name = self.get_option('region')
-        ssm_parameters = dict()

         client = boto3.client('ssm', region_name=region_name)
         self._client = client
-        response = client.start_session(Target=self.instance_id, Parameters=ssm_parameters)
+
+        if self.is_windows:
+            ssm_parameters = dict()
+            response = client.start_session(Target=self.instance_id, Parameters=ssm_parameters)
+        else:
+            ssm_parameters = {"command": ["bash -l"]}
+            response = client.start_session(Target=self.instance_id, DocumentName="AWS-StartInteractiveCommand", Parameters=ssm_parameters)
+
         self._session_id = response['SessionId']

         cmd = [

I think this solution work partially, because is not closing the sessions in session manager. And the system open one session for every one task in Ansible. This causes if I execute a playbook with 10 tasks, 10 session is opening and no closing xd ...

mikeneiderhauser commented 4 years ago

The above wont work if there is anything custom about the systems (or users) bash profile that changes what is presented on stdout. I've tried this even with --noprofile and it still hangs for me when gathering facts.

DanielFallon commented 4 years ago

@mikeneiderhauser hey btw I did rewrite this using paramiko ssh over the weekend. it's still a bit grody but should work. The biggest problem is that the connections aren't cached. so there are A LOT of aws API calls. I didn't realize that paramiko didn't support persistent connections. I'll have to switch it to borrow from the SSH connection plugin and leverage ssh pipelining instead. Hopefully I can detect whether or not a pipeline is open and not restart the connection from the logic of that plugin. If not, then I'll have to build a disk based cache of some kind for the key timeout.

(Also sorry for the extra cruft trying to cache things, I forgot that ansible forks to a new process for each connection) Here's what I tried and I think this finally should work just as well as the ssh plugin, but it will exhaust your aws api rate limit rather quickly. https://gist.github.com/DanielFallon/dffad373c688da32919e709d6738d715

DanielFallon commented 4 years ago

As another workaround, here's a proxy command that should just work to allow connections via ec2-instance-connect and then proxy via ssm:

https://gist.github.com/DanielFallon/45310cc76f46c1f1b2f7272d19b76312

^Technically this is significantly more efficient than the plugin in the gist because SSH Multiplexing will be used if it available. (and is by default)

mikeneiderhauser commented 4 years ago

@DanielFallon thanks for this. ill take a look

DanielFallon commented 4 years ago

I'll turn this into a pull request later this week, but here's a plugin that uses ec2-instance-connect to push a temporary ssh key to the host and uses this to authenticate.

If paired with a proxycommand for ssm this should be pretty effective: https://gist.github.com/DanielFallon/67572f4439602774d02fbbe3bce8a5b9

thomas-anderson-bsl commented 3 years ago

does python ansible-tmp-1601664494.6311371-10330-41749016017816/AnsiballZ_setup.py open stdin? if yes, then it might be eating the printf command (meaning that the shell would never receive it.)

Yes I'm finding that the python command eats the printf. I can get around it by adding an echo | {cmd}; to the wrapped command, and the mark end printf seems to appear.

Here's the full patch, making use of your previous patch @DanielFallon

I made some minor changes so we don't use the regexes, as they were being fickle.

diff --git a/playbooks/connection_plugins/aws_ssm2.py b/playbooks/connection_plugins/aws_ssm2.py
index 1c01dba..baa15f3 100644
--- a/playbooks/connection_plugins/aws_ssm2.py
+++ b/playbooks/connection_plugins/aws_ssm2.py
@@ -444,10 +444,94 @@ class Connection(ConnectionBase):
     def _prepare_terminal(self):
         """ perform any one-time terminal settings """

-        if not self.is_windows:
-            cmd = "stty -echo\n" + "PS1=''\n"
-            cmd = to_bytes(cmd, errors="surrogate_or_strict")
-            self._session.stdin.write(cmd)
+        # No windows setup for now
+        if self.is_windows:
+            return
+
+        # *_complete variables are 3 valued:
+        #   - None: not started
+        #   - False: started
+        #   - True: complete
+
+        startup_complete = False
+        disable_echo_complete = None
+        disable_echo_cmd = to_bytes("stty -echo\n", errors="surrogate_or_strict")
+
+        disable_prompt_complete = None
+        end_mark = "".join(
+            [random.choice(string.ascii_letters) for i in xrange(self.MARK_LENGTH)]
+        )
+        disable_prompt_cmd = to_bytes(
+            "PS1='' ; printf '\\n%s\\n' '" + end_mark + "'\n",
+            errors="surrogate_or_strict",
+        )
+        disable_prompt_reply = re.compile(
+            r"\r\r\n" + re.escape(end_mark) + r"\r\r\n", re.MULTILINE
+        )
+
+        stdout = ""
+        # Custom command execution for when we're waiting for startup
+        stop_time = int(round(time.time())) + self.get_option("ssm_timeout")
+        while (not disable_prompt_complete) and (self._session.poll() is None):
+            remaining = stop_time - int(round(time.time()))
+            if remaining < 1:
+                self._timeout = True
+                display.vvvv(
+                    "PRE timeout stdout: {0}".format(to_bytes(stdout)), host=self.host
+                )
+                raise AnsibleConnectionFailure(
+                    "SSM start_session timeout on host: %s" % self.instance_id
+                )
+            if self._poll_stdout.poll(1000):
+                stdout += to_text(self._stdout.read(1024))
+                display.vvvv(
+                    "PRE stdout line: {0}".format(to_bytes(stdout)), host=self.host
+                )
+            else:
+                display.vvvv("PRE remaining: {0}".format(remaining), host=self.host)
+
+            # wait til prompt is ready
+            if startup_complete is False:
+                match = str(stdout).find("Starting session with SessionId")
+                if match != -1:
+                    display.vvvv("PRE startup output received", host=self.host)
+                    startup_complete = True
+
+            # disable echo
+            if startup_complete and (disable_echo_complete is None):
+                display.vvvv(
+                    "PRE Disabling Echo: {0}".format(disable_echo_cmd), host=self.host
+                )
+                self._session.stdin.write(disable_echo_cmd)
+                disable_echo_complete = False
+
+            if disable_echo_complete is False:
+                match = str(stdout).find("stty -echo")
+                if match != -1:
+                    disable_echo_complete = True
+
+            # disable prompt
+            if disable_echo_complete and disable_prompt_complete is None:
+                display.vvvv(
+                    "PRE Disabling Prompt: {0}".format(disable_prompt_cmd),
+                    host=self.host,
+                )
+                self._session.stdin.write(disable_prompt_cmd)
+                disable_prompt_complete = False
+
+            if disable_prompt_complete is False:
+                match = disable_prompt_reply.search(stdout)
+                if match:
+                    stdout = stdout[match.end() :]
+                    disable_prompt_complete = True
+
+        if not disable_prompt_complete:
+            raise AnsibleConnectionFailure(
+                "SSM process closed during _prepare_terminal on host: %s"
+                % self.instance_id
+            )
+        else:
+            display.vvv("PRE Terminal configured", host=self.host)

     def _wrap_command(self, cmd, sudoable, mark_start, mark_end):
         """ wrap command so stdout and status can be extracted """
@@ -460,14 +544,9 @@ class Connection(ConnectionBase):
             if sudoable:
                 cmd = "sudo " + cmd
             cmd = (
-                "echo "
-                + mark_start
-                + "\n"
-                + cmd
-                + "\necho $'\\n'$?\n"
-                + "echo "
-                + mark_end
-                + "\n"
+                f"printf '%s\\n' '{mark_start}';\n"
+                f"echo | {cmd};\n"
+                f"printf '\\n%s\\n%s\\n' \"$?\" '{mark_end}';\n"
             )

         display.vvvv(u"_wrap_command: '{0}'".format(to_text(cmd)), host=self.host)

I put the patched version of aws_ssm2.py into my playbooks directory in connection_plugins/ dir, set ansible_connection: aws_ssm2, and was able to successfully use ssm to run playbooks against my Ubuntu EC2 instance.

jonormann commented 3 years ago

Hi, is there an official solution? I am facing the same problem.

wilinger commented 3 years ago

Hi,

is there an official solution?

I am facing the same problem.

Same

lassebenni commented 3 years ago

does python ansible-tmp-1601664494.6311371-10330-41749016017816/AnsiballZ_setup.py open stdin? if yes, then it might be eating the printf command (meaning that the shell would never receive it.)

Yes I'm finding that the python command eats the printf. I can get around it by adding an echo | {cmd}; to the wrapped command, and the mark end printf seems to appear.

Here's the full patch, making use of your previous patch @DanielFallon

I made some minor changes so we don't use the regexes, as they were being fickle.

[...]

I put the patched version of aws_ssm2.py into my playbooks directory in connection_plugins/ dir, set ansible_connection: aws_ssm2, and was able to successfully use ssm to run playbooks against my Ubuntu EC2 instance.

Even with the patched code from https://github.com/Filirom1/community.aws/blob/cb79826540fc58a2d13c1d95dde91c7544578748/plugins/connection/aws_ssm.py#L524 and your instructions, I still experience an issue:

ansible-playbook -i inventory_aws_ssm.yml -c aws_ssm2 playbook.yml

#playbook.yml
---
- name: Test command
  gather_facts: false
  hosts: all
  vars:
    ansible_connection: aws_ssm2
    ansible_aws_ssm_region: eu-central-1
  tasks:
    - name: test
      command:
        cmd: ls -l
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: TypeError: expected string or bytes-like object
fatal: [instance-id]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}

Update fixing the new issue: TLDR; Use the patched codewith @thomas-anderson-bsl instructions and ALSO set the var.ansible_aws_ssm_bucket_name. The bucket name is required and there is no graceful exit if it's not supplied.

Apparently the plugin requires the ansible_aws_ssm_bucket_name to be set, otherwise the run fails. I am still not sure why it tries to run _file_transport_command on a cmd: ls -l command.. Maybe someone else can clarify that. Anyway, with the patched code and setting the bucket variable - make sure your account has access rights to it-, I can connect to the instance 👍

ChenTsungYu commented 3 years ago

I have the same issue with Debian e.g. ubuntu

eliskovets commented 3 years ago

As a work-around you can switch default shell to bash instead of sh in Session Manager preferences: https://aws.amazon.com/premiumsupport/knowledge-center/ssm-session-manager-change-shell/

hexsel commented 3 years ago

As with @eliskovets , I had to change the default shell to bash, but I used the instructions at https://unix.stackexchange.com/questions/442510/how-to-use-bash-for-sh-in-ubuntu.

jagibson commented 3 years ago

For me, changing the shell using the profile in the Session Manager setting broke logins to Amazon Linux 2 for some reason (which was working OK before). What worked was changing the link that /bin/sh uses via the "sudo dpkg-reconfigure dash" command in Ubuntu. Thanks @eliskovets!

bedge commented 3 years ago

Ran into the identical thing: https://stackoverflow.com/questions/68734815/ansible-aws-ssm-connectivity-plugin-ciphertext-refers-to-a-customer-master Disabling KMS encryption in the SSM config fixed it: Screen Shot 2021-08-10 at 18 28 17

Does this warrant a separate bug? Or, am I missing a config value to enable the "KMS encryption" in the session manager preferences?

(Just confirmed while replicating that the default shell needed to NOT be dash as well)

bedge commented 3 years ago

(Update, moving away from dash has no immediate downside)

Re: Changing the ubuntu default shell fix - ansible version

# See "/var/cache/debconf/config.dat" for name of config item after changing manually
- name: aws-ssm ansible plugin fails if dash is the default shell
  ansible.builtin.debconf:
    name: dash/sh
    question: dash/sh
    value: false
    vtype: boolean
jagibson commented 3 years ago

@bedge your vtype should be boolean and I don't think you should be quoting 'false'.

I haven't tested with cloudinit yet but hopefully that's all you need to do.

bedge commented 3 years ago

@jagibson Confirmed your suggestion os correct for the ansible (no) dash fix:

# See "/var/cache/debconf/config.dat" for name of config item after changing manually
- name: aws-ssm ansible plugin fails if dash is the default shell
  ansible.builtin.debconf:
    name: dash/sh
    question: dash/sh
    value: false
    vtype: boolean

thanks

Now the remaining issue is the s3 permissions. Seems to work only with

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": "*"
        }
    ]
}

Any attempt to reign that in and it starts failing with:

2021-08-13 20:29:58 INFO [ssm-agent-worker] [MessageGatewayService] Sending reply {
  "SchemaVersion": 1,
  "TaskId": "bruce.edge@xxx.com-0bc9fc7bbc0da379b",
  "Topic": "agent_task_complete",
  "FinalTaskStatus": "Failed",
  "IsRoutingFailure": false,
  "AwsAccountId": "",
  "InstanceId": "i-01c257e02de698f9b",
  "Output": "Couldn't start the session because we are unable to validate encryption on Amazon S3 bucket. Error: AccessDenied: Access Denied\n\tstatus code: 403, request id: F20PG8SWSH8V4J6Z, host id: 9JY/CcOq6C6Bswaw7AJfbGLcTlzD8scLt/nEBncsI8ac9GPTEeVMDTU7B2yWcgDxn0W+fsZINW4=",
  "S3Bucket": "",
  "S3UrlSuffix": "",
  "CwlGroup": "",
  "CwlStream": ""
}

Is there any way to restrict the s3 access to a specific bucket:///... ?

bedge commented 3 years ago

After some unreliable success with dash removed as the default for /bin/sh on the targets, I switched to setting the shell profile in the AWS session-manager prefs and that seems to be more reliable. ie: hasn't failed with that set, and consistently fails with it unset. Not sure how this is different from removing the /bin/sh -> dash association, but the end result is definitely better.

shinebayar-g commented 3 years ago

Does this plugin work?

1st attempt:

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ValueError: invalid literal for int() with base 10: "echo $'\\n'$?"
ubuntu2004 | FAILED! => {
    "msg": "Unexpected failure during module execution.",
    "stdout": ""
}

2nd attempt: Changed default shell to /bin/bash in Session Manager's Preferences as suggested above.

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: TypeError: expected string or bytes-like object
ubuntu2004 | FAILED! => {
    "msg": "Unexpected failure during module execution.",
    "stdout": ""
}
wilinger commented 3 years ago

Why isn't anyone using the fix here #583 ? It seems to work flawlessly for me.

Hokwang commented 2 years ago

I found this problem occurs windows also.

ulyr commented 1 year ago

Recent AWS Linux AMIs add some unicode characters to inputs/outputs that can cause this same error message, if you recently launched a new instance / updated AMIs try the following fix - https://github.com/ansible-collections/community.aws/issues/1756#issuecomment-1572730792