aws / amazon-ssm-agent

An agent to enable remote management of your EC2 instances, on-premises servers, or virtual machines (VMs).
https://aws.amazon.com/systems-manager/
Apache License 2.0
1.06k stars 326 forks source link

ECS execute_command (in SDK) Session hangs when log-output is above certain length #443

Open jsboak opened 2 years ago

jsboak commented 2 years ago

I am trying to execute commands on ECS containers and have the logged output sent to CloudWatch (or S3). I have successfully configured everything (confirmed with CLI and SDK).

However, when using the SDK and the logged output is over a certain size (e.g. 10 full lines of text), the session from Session Manager hangs and needs to be terminated - this then sends the output to CloudWatch, however the logged-output is typically incomplete. It typically sends ~80% of the logged output. This is after waiting >10mins for the session to complete on its own.

This same behavior is not experienced when using the CLI: with the CLI, the output is consistently returned in full (regardless of output size). The behavior is also not experienced when the logged-output is not configured to send to CloudWatch or S3.

Both the CLI and SDK are using the same configurations and are testing against the same containers in the same cluster with the same API Credentials.

Here is sample python that does not work (when ECS is configured to send session output to CW/S3):

import boto3

# boto3.set_stream_logger('')

client = boto3.client('ecs')

response = client.execute_command(
    cluster='default',
    container='nginx',
    interactive=True,
    task='7e10ee8306fd44b5b54a420b1e977af3',
    command='tail -n 15 /var/log/amazon/ssm/amazon-ssm-agent.log')

print("Execute Command Response: \n" + str(response))

The session starts successfully, and it's evident that the commands are sent to the container, but then the session hangs. However, when we simply change -n 15 down to (for example) -n 5, everything works as expected (session closes on its own and full log-output is sent to CW or S3).

Here is the CLI equivalent that consistently works, regardless of logged-output size. Again, using the same credentials and same ECS Tasks:

aws ecs execute-command --cluster default --task 7e10ee8306fd44b5b54a420b1e977af3 --interactive --container nginx --command "tail -n 100 /var/log/amazon/ssm/amazon-ssm-agent.log"

Here is some of the SSM Agent Log. The ecs-execute-command-088c9ed1039b8ab95 session had to be Terminated manually. ssm-agent-log.txt

andymac4182 commented 2 years ago

The CLI uses the https://github.com/aws/session-manager-plugin to run. https://github.com/aws/aws-cli/blob/develop/awscli/customizations/ecs/executecommand.py It might be worth having a look at using that to get the output.

Camoen commented 2 years ago

I believe we hit this issue as well. Our workaround for now is to send our commands to a session (tmux, screen, etc.), rather than allowing the ECS execute-command SDK to run the commands directly. This allows the ECS execute-command to terminate rapidly while leaving the running session alive inside the ECS container.

vitaly-eureka-security commented 2 years ago

I believe we hit this issue as well. Our workaround for now is to send our commands to a session (tmux, screen, etc.), rather than allowing the ECS execute-command SDK to run the commands directly. This allows the ECS execute-command to terminate rapidly while leaving the running session alive inside the ECS container.

@Camoen - in the workaround above, are you still able to get the command output in S3/CW?

We occasionally hit the issue even for commands with small (less than 1K characters) outputs. The execution hangs for ~20 minutes after which the output gets uploaded to S3 (when the output is small it is fully stored there).

This is how it looks in the container instance logs: ssm-agent-log.log

Camoen commented 2 years ago

Unfortunately, we don't have our clusters configured to emit ECS exec logs to cloudwatch/S3. I just found out about that yesterday, and haven't had a chance to enable it yet. Our command does indirectly trigger an operation in our application that uploads GBs of data to S3, but it's not handled by the ECS exec command directly (the upload is done within our tmux session). You may be better off if you use ECS exec to send your command to a session and manually handle the required upload (pipe your command outputs to a file and tack on an aws s3 cp at the end of your command).

On Thu, Jun 30, 2022, 4:43 PM vitaly-eureka-security < @.***> wrote:

I believe we hit this issue as well. Our workaround for now is to send our commands to a session (tmux, screen, etc.), rather than allowing the ECS execute-command SDK to run the commands directly. This allows the ECS execute-command to terminate rapidly while leaving the running session alive inside the ECS container.

@Camoen https://github.com/Camoen - in the workaround above, are you still able to get the command output in S3/CW?

We occasionally hit the issue even for commands with small (less than 1K characters) outputs. The execution hangs for ~20 minutes after which the output gets uploaded to S3 (when the output is small it is fully stored there).

This is how it looks in the container instance logs: ssm-agent-log.log https://github.com/aws/amazon-ssm-agent/files/9024043/ssm-agent-log.txt

— Reply to this email directly, view it on GitHub https://github.com/aws/amazon-ssm-agent/issues/443#issuecomment-1171774976, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD6MNSLZKMTESD5HFSNGKIDVRYWJJANCNFSM5T733S4A . You are receiving this because you were mentioned.Message ID: @.***>