aws_ssm should re-use session across tasks for speed improvement

mdavis-xyz commented 1 year ago

Summary

I am finding the aws_ssm module is slow. e.g. it takes 10 seconds to run a file task which checks and does not change an existing directory. My gut feel is that it's at least 4 times slower than SSH.

Whilst the SSM protocol itself slower than SSH, I think we can drastically improve performance by re-using the SSM session between tasks.

When running a playbook with aws_ssm as a connection plugin, and with -vvvv as an argument, I can see ESTABLISH SSM CONNECTION TO ... printed at the start of each task. So I think the module is calling client.start_session() once per task, and terminating the session at the end of each task. The termination is called from __del__, which is called from Ansible core. I'm hoping there's some easy way to tell Ansible core to re-use the session.

At first I thought this is what pipelineing is for. I've read the docs and don't quite understand pipelining. But I think it's unrelated. (Although if we send small files through SSM instead of via S3, that would also speed up the plugin.)

I think what we want is "persistent" sessions. In ConnectionBase in the Ansible core repo, I see

    # the following control whether or not the connection supports the
    # persistent connection framework or not
    supports_persistence = False
    force_persistence = False

These values are not overridden for the aws_ssm.py module. So my hypothesis is that this is why the session is not re-used.

What I've tried

To try to implement this improvement, I subclassed the connection in aws_ssm (since I find the normal development installation instructions confusing) into a connection plugin called my_ssm.

When I run with that, I get an error on the first task for my SSM host.

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: KeyError: 'Requested entry (plugin_type: connection plugin: my_ssm setting: persistent_command_timeout ) was not defined in configuration.'
fatal: [pdr_batcher_matt]: FAILED! => 
  msg: 'Unexpected failure during module execution: ''Requested entry (plugin_type: connection plugin: my_ssm setting: persistent_command_timeout ) was not defined in configuration.'''
  stdout: ''

I don't know what configuration it's referring to. I tried adding persistent_command_timeout=60 to the Connection class. That didn't fix it. The Ansible docs say that PERSISTENT_COMMAND_TIMEOUT is already defined by default, equal to 30. Setting environment variable ANSIBLE_PERSISTENT_COMMAND_TIMEOUT=60 did not fix this issue. Adding the following to ansible.cfg did not fix the issue:

[persistent_connection]
command_timeout = 300
connect_timeout = 300

I defined in my host vars, next to the other SSM config:

ansible_my_ssm_persistent_command_timeout: 60

Still, I get the same error.

I modified the docstring in the connection plugin, adding:

  persistent_command_timeout:
    description: add expl here
    default: 60
    type: integer
    vars:
    - name: ansible_my_ssm_persistent_command_timeout

Now I get a different error. So apparently the doc string gets used by the code of Ansible? That's really surprising.

I don't understand how that works though. e.g. the docstring for aws_ssm includes "ssm_timeout", but that is not referenced anywhere else in the file. Is there something I'm missing, or is this a bug because that timeout config option is not passed to boto?

Anyway, after adding a new option to the docstring, the new error I get is:

    Traceback (most recent call last):
      File "/Users/matthew/.pyenv/versions/3.10.0/lib/python3.10/site-packages/ansible/module_utils/connection.py", line 164, in _exec_jsonrpc
        response = json.loads(out)
      File "/Users/matthew/.pyenv/versions/3.10.0/lib/python3.10/json/__init__.py", line 346, in loads
        return _default_decoder.decode(s)
      File "/Users/matthew/.pyenv/versions/3.10.0/lib/python3.10/json/decoder.py", line 337, in decode
        obj, end = self.raw_decode(s, idx=_w(s, 0).end())
      File "/Users/matthew/.pyenv/versions/3.10.0/lib/python3.10/json/decoder.py", line 355, in raw_decode
        raise JSONDecodeError("Expecting value", s, err.value) from None
    json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/Users/matthew/.pyenv/versions/3.10.0/bin/ansible-connection", line 8, in <module>
        sys.exit(main())
      File "/Users/matthew/.pyenv/versions/3.10.0/lib/python3.10/site-packages/ansible/cli/scripts/ansible_connection_cli_stub.py", line 335, in main
        messages.extend(Connection(socket_path).pop_messages())
      File "/Users/matthew/.pyenv/versions/3.10.0/lib/python3.10/site-packages/ansible/module_utils/connection.py", line 194, in __rpc__
        response = self._exec_jsonrpc(name, *args, **kwargs)
      File "/Users/matthew/.pyenv/versions/3.10.0/lib/python3.10/site-packages/ansible/module_utils/connection.py", line 173, in _exec_jsonrpc
        raise ConnectionError(
    ansible.module_utils.connection.ConnectionError: Unable to decode JSON from response to pop_messages(). Received 'None'.

That seems like a very strange low level error. That comes from here, but I don't know why. Now I'm lost.

Issue Type

Feature Idea

Component Name

aws_ssm

Additional Information

See above

Code of Conduct

[X] I agree to follow the Ansible Code of Conduct

mdavis-xyz commented 1 year ago

@psharkey @hanumantharaomvl @gau1991 Can you help me with this?

mdavis-xyz commented 1 year ago

The ansible.netcommon.persistent connection plugin sounds relevant.

This is a helper plugin to allow making other connections persistent.

But the documentation doesn't say how to use it.

The general connection plugin docs say:

only one can be used per host at a time

I don't understand how ansible.netcommon.persistent can be a helper for other connection plugins if only one plugin can be used at a time.

kaeraali-flutterint commented 1 year ago

Is this not exactly what pipelining is for? I see has_pipelining = False in the Connection class, and I wonder if this means the aws_ssm connection plugin needs to implement pipelining in order to speed it up?

mdavis-xyz commented 1 year ago

No, that's apparently not what pipelining is. (I assumed so too.) See this email.

Normally Ansible sends a .py file to the target. Then it sends a shell command to execute the file. Pipelining just pastes the python script into the terminal (e.g. like python < cat or something like that). In the context of aws_ssm I think that would mean that the .py files don't go via S3, and are pasted into the terminal directly. But with pipelining=True it would still initiate a new session for each task, and then within the task paste the python script into the shell.

kaeraali-flutterint commented 1 year ago

I have spent a bit of time looking at why this runs so slowly, and it's not just that the sessions aren't being reused, it's that the file transfers via S3 are slow.

For Linux servers I tried SSH over SSM with https://github.com/elpy1/ssh-over-ssm and one of my playbooks went from 4 minutes with the aws_ssm connection plugin down to 40 seconds with SSH over SSM. That's a huge time saving, and just requires some local configuration and switching back to the ssh connection plugin.

However, that solution doesn't work for Windows EC2 instances. And that's where the real performance problem is for us anyway, since every task involves copying powershell scripts to the host that never get reused. And that feels like a more fundamental Ansible issue rather then something specific to this plugin.

ansible-collections / community.aws