Agent fails to start when installed from user data script

joeltg commented 5 years ago

Summary

The ecs agent fails to start when enabled in a user data script.

Description

Per the documentation here, I'm trying to install the ECS Container Agent on an an Amazon Linux 2 EC2 instance. I'm launching Linux 2 with on a t2.micro with all defaults except IAM Role set to ecsInstanceRole and user data set to

#!/bin/bash

mkdir -p /etc/ecs
echo "ECS_CLUSTER=default" > /etc/ecs/ecs.config

amazon-linux-extras disable docker
amazon-linux-extras install -y ecs
systemctl enable --now ecs

Expected Behavior

The ecs agent starts and the instance appears in the default cluster

Observed Behavior

The instance does not appear in the default cluster. SSHing into the instance:

[ec2-user@ip-*** ~]$ systemctl status ecs
● ecs.service - ECS Agent
   Loaded: loaded (/usr/lib/systemd/system/ecs.service; enabled; vendor preset: disabled)
   Active: inactive (dead)

... and journalctl doesn't have any log entries for ecs either.

Now if I try to start the ecs agent with sudo systemctl start ecs, the command will hang indefinitely, but if I stop it with sudo systemctl stop ecs first and then start again, it will succeed and show up as registered in the default cluster.

Environment Details

Amazon Linux 2 AMI t2.micro

[ec2-user@ip-*** ~]$ curl http://localhost:51678/v1/metadata
curl: (7) Failed to connect to localhost port 51678: Connection refused

Supporting Log Snippets

(relevant error at bottom)

[ec2-user@ip-*** ~]$ cat /var/log/cloud-init-output.log 
Cloud-init v. 18.2-72.amzn2.0.6 running 'init-local' at Wed, 28 Nov 2018 20:50:16 +0000. Up 5.01 seconds.
Cloud-init v. 18.2-72.amzn2.0.6 running 'init' at Wed, 28 Nov 2018 20:50:18 +0000. Up 7.37 seconds.
.
.
.
No packages needed for security; 0 packages available
No packages marked for update
Cloud-init v. 18.2-72.amzn2.0.6 running 'modules:final' at Wed, 28 Nov 2018 20:50:25 +0000. Up 14.74 seconds.
Beware that disabling topics is not supported after they are installed.
u'docker' was not enabled. Ignoring.
  0  ansible2                 available    [ =2.4.2  =2.4.6 ]
  2  httpd_modules            available    [ =1.0 ]
  3  memcached1.5             available    [ =1.5.1 ]
  4  nginx1.12                available    [ =1.12.2 ]
  5  postgresql9.6            available    [ =9.6.6  =9.6.8 ]
  6  postgresql10             available    [ =10 ]
  8  redis4.0                 available    [ =4.0.5  =4.0.10 ]
  9  R3.4                     available    [ =3.4.3 ]
 10  rust1                    available    \
        [ =1.22.1  =1.26.0  =1.26.1  =1.27.2 ]
 11  vim                      available    [ =8.0 ]
 12  golang1.9                available    [ =1.9.2 ]
 13  ruby2.4                  available    [ =2.4.2  =2.4.4 ]
 15  php7.2                   available    \
        [ =7.2.0  =7.2.4  =7.2.5  =7.2.8  =7.2.11 ]
 16  php7.1                   available    [ =7.1.22 ]
 17  lamp-mariadb10.2-php7.2  available    \
        [ =10.2.10_7.2.0  =10.2.10_7.2.4  =10.2.10_7.2.5
          =10.2.10_7.2.8  =10.2.10_7.2.11 ]
 18  libreoffice              available    [ =5.0.6.2_15  =5.3.6.1 ]
 19  gimp                     available    [ =2.8.22 ]
 20  docker                   available    \
        [ =17.12.1  =18.03.1  =18.06.1 ]
 21  mate-desktop1.x          available    [ =1.19.0  =1.20.0 ]
 22  GraphicsMagick1.3        available    [ =1.3.29 ]
 23  tomcat8.5                available    [ =8.5.31  =8.5.32 ]
 24  epel                     available    [ =7.11 ]
 25  testing                  available    [ =1.0 ]
 26  ecs                      available    [ =stable ]
 27  corretto8                available    [ =1.8.0_192 ]
 28  firecracker              available    [ =0.11 ]
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
Cleaning repos: amzn2-core amzn2extra-ecs
6 metadata files removed
2 sqlite files removed
0 metadata files removed
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
Resolving Dependencies
--> Running transaction check
---> Package ecs-init.x86_64 0:1.22.0-4.amzn2 will be installed
--> Processing Dependency: docker >= 17.06.2ce for package: ecs-init-1.22.0-4.amzn2.x86_64
--> Running transaction check
---> Package docker.x86_64 0:18.06.1ce-5.amzn2 will be installed
--> Processing Dependency: pigz for package: docker-18.06.1ce-5.amzn2.x86_64
--> Processing Dependency: libcgroup for package: docker-18.06.1ce-5.amzn2.x86_64
--> Processing Dependency: libltdl.so.7()(64bit) for package: docker-18.06.1ce-5.amzn2.x86_64
--> Running transaction check
---> Package libcgroup.x86_64 0:0.41-15.amzn2 will be installed
---> Package libtool-ltdl.x86_64 0:2.4.2-22.2.amzn2.0.2 will be installed
---> Package pigz.x86_64 0:2.3.4-1.amzn2.0.1 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

================================================================================
 Package          Arch       Version                   Repository          Size
================================================================================
Installing:
 ecs-init         x86_64     1.22.0-4.amzn2            amzn2extra-ecs      12 M
Installing for dependencies:
 docker           x86_64     18.06.1ce-5.amzn2         amzn2extra-ecs      37 M
 libcgroup        x86_64     0.41-15.amzn2             amzn2-core          65 k
 libtool-ltdl     x86_64     2.4.2-22.2.amzn2.0.2      amzn2-core          49 k
 pigz             x86_64     2.3.4-1.amzn2.0.1         amzn2-core          81 k

Transaction Summary
================================================================================
Install  1 Package (+4 Dependent packages)

Total download size: 49 M
Installed size: 194 M
Downloading packages:
--------------------------------------------------------------------------------
Total                                               53 MB/s |  49 MB  00:00     
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : libtool-ltdl-2.4.2-22.2.amzn2.0.2.x86_64                     1/5 
  Installing : libcgroup-0.41-15.amzn2.x86_64                               2/5 
  Installing : pigz-2.3.4-1.amzn2.0.1.x86_64                                3/5 
  Installing : docker-18.06.1ce-5.amzn2.x86_64                              4/5 
  Installing : ecs-init-1.22.0-4.amzn2.x86_64                               5/5 
  Verifying  : pigz-2.3.4-1.amzn2.0.1.x86_64                                1/5 
  Verifying  : docker-18.06.1ce-5.amzn2.x86_64                              2/5 
  Verifying  : libcgroup-0.41-15.amzn2.x86_64                               3/5 
  Verifying  : libtool-ltdl-2.4.2-22.2.amzn2.0.2.x86_64                     4/5 
  Verifying  : ecs-init-1.22.0-4.amzn2.x86_64                               5/5 

Installed:
  ecs-init.x86_64 0:1.22.0-4.amzn2                                              

Dependency Installed:
  docker.x86_64 0:18.06.1ce-5.amzn2           libcgroup.x86_64 0:0.41-15.amzn2 
  libtool-ltdl.x86_64 0:2.4.2-22.2.amzn2.0.2  pigz.x86_64 0:2.3.4-1.amzn2.0.1  

Complete!
Installing ecs-init
  0  ansible2                 available    [ =2.4.2  =2.4.6 ]
  2  httpd_modules            available    [ =1.0 ]
  3  memcached1.5             available    [ =1.5.1 ]
  4  nginx1.12                available    [ =1.12.2 ]
  5  postgresql9.6            available    [ =9.6.6  =9.6.8 ]
  6  postgresql10             available    [ =10 ]
  8  redis4.0                 available    [ =4.0.5  =4.0.10 ]
  9  R3.4                     available    [ =3.4.3 ]
 10  rust1                    available    \
        [ =1.22.1  =1.26.0  =1.26.1  =1.27.2 ]
 11  vim                      available    [ =8.0 ]
 12  golang1.9                available    [ =1.9.2 ]
 13  ruby2.4                  available    [ =2.4.2  =2.4.4 ]
 15  php7.2                   available    \
        [ =7.2.0  =7.2.4  =7.2.5  =7.2.8  =7.2.11 ]
 16  php7.1                   available    [ =7.1.22 ]
 17  lamp-mariadb10.2-php7.2  available    \
        [ =10.2.10_7.2.0  =10.2.10_7.2.4  =10.2.10_7.2.5
          =10.2.10_7.2.8  =10.2.10_7.2.11 ]
 18  libreoffice              available    [ =5.0.6.2_15  =5.3.6.1 ]
 19  gimp                     available    [ =2.8.22 ]
 20  docker                   available    \
        [ =17.12.1  =18.03.1  =18.06.1 ]
 21  mate-desktop1.x          available    [ =1.19.0  =1.20.0 ]
 22  GraphicsMagick1.3        available    [ =1.3.29 ]
 23  tomcat8.5                available    [ =8.5.31  =8.5.32 ]
 24  epel                     available    [ =7.11 ]
 25  testing                  available    [ =1.0 ]
 26  ecs=latest               enabled      [ =stable ]
 27  corretto8                available    [ =1.8.0_192 ]
 28  firecracker              available    [ =0.11 ]
Created symlink from /etc/systemd/system/multi-user.target.wants/ecs.service to /usr/lib/systemd/system/ecs.service.
Job for ecs.service canceled.
Nov 28 20:56:28 cloud-init[3285]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
Nov 28 20:56:28 cloud-init[3285]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Nov 28 20:56:28 cloud-init[3285]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Cloud-init v. 18.2-72.amzn2.0.6 finished at Wed, 28 Nov 2018 20:56:28 +0000. Datasource DataSourceEc2.  Up 377.68 seconds

petderek commented 5 years ago

Hi,

Starting ecs this way via userdata will cause a deadlock in systemd's startup scripts for docker and ecs.

The systemd units for both ecs and docker have a directive to wait for cloud-init to finish before starting. The cloud-init process isn't considered finished until your userdata has finished running. So, requesting ecs (or docker) to start within userdata will cause this condition.

You should be able to fix this by adding a '--no-block' flag: systemctl enable --now --no-block ecs.service

Please let me know if you have any additional questions.

joeltg commented 5 years ago

That's exactly what I was looking for! Thank you!

dpavlov-smartling commented 5 years ago

Hi guys, don't you think that aws agent unit file for systemd should be corrected and line After=cloud-final.service removed from it?

leadelngalame1611 commented 5 years ago

hi @petderek thanks.

mixja commented 5 years ago

I found that the systemctl enable --now flag doesn't work in current systemd version (219) for Amazon Linux 2 ECS AMI - see https://unix.stackexchange.com/questions/374280/the-now-switch-of-systemctl.

Easiest way to fix this is as follows:

$ sudo systemctl edit --full ecs
...
...
[Unit]
Description=Amazon Elastic Container Service - container agent
Documentation=https://aws.amazon.com/documentation/ecs/
Requires=docker.service
After=docker.service
After=cloud-final.service # REMOVE THIS LINE
...
...
$ sudo systemctl daemon-reload

Or more directly (e.g. in a Packer script):

sudo cp /usr/lib/systemd/system/ecs.service /etc/systemd/system/ecs.service
sudo sed -i '/After=cloud-final.service/d' /etc/systemd/system/ecs.service
sudo systemctl daemon-reload

levigroker commented 5 years ago

This is also an issue if services are installed, and started by the user data scripts, which have the ecs.service as an "After" dependency.

The fact that the cloud-final.service blocks until the user data completes causes a circular dependency which prevents the system from loading.

i.e. user.data > my.service > ecs.service > cloud-final.service > user.data

See the recent comments in #1740

This should be reopened and the ecs.service config should be updated to remove the cloud-final.service as a dependency.

petderek commented 5 years ago

This should be reopened and the ecs.service config should be updated to remove the cloud-final.service as a dependency.

The only problem with this is that it potentially breaks another use case. The reason we have the cloud-final.service is so that a user can modify ecs configuration as part of the userdata script. Our documentation has examples like this:

echo ECS_CLUSTER=my_cluster >> /etc/ecs/ecs.config

Since we are guaranteed that userdata is completely processed before agent starts, no further systemd configuration is required in order to ensure that agent receives the intended values.

I'm thinking that the best path for the optimized AMI is to leave the current configuration as is for general use. The workarounds include:

Use systemctl enable --now --no-block ecs.service if you need ecs to be available as part of your userdata script.
Use approaches similar to what @mixja and @levigroker have suggested if you are customizing your own AMI and know that you'll be starting the agent within your userdata script.

dpavlov-smartling commented 5 years ago

Hi @petderek In such case one of the mentioned workarounds like systemctl enable --now --no-block ecs.service or update of unit file should be should be reflected in AWS ECS agent documentation page on AWS in part related to manual ECS agent installation. Systemd doesn't show any errors or other notifications about deadlock dependency.

mixja commented 5 years ago

I would agree to leave the behaviour as is to accommodate the simple use cases of modifying ECS configuration during user data.

In my use case, I actually do a "health check" of sorts in the user data section (as part of cfn-init) and wait until the local ECS metadata endpoint is reporting back the agent has joined the desired cluster, before reporting back to CloudFormation that the instance has successfully initialised. This is a more advanced use case that can easily accommodate the workarounds required.

levigroker commented 5 years ago

@petderek Do you know what the timeline of the boot process looks like? Could it be that ecs configuration is setup in the cloud-boothook instead of a script phase? Would this be early enough?

I understand your hesitation to break the "simple" use case where ecs config is modified in the user data script, but as it is this causes a blocking issue which is far more severe than a config not being loaded as expected. I've spent nearly a week trying to understand why the ecs.service wasn't starting, until a college found this issue... I have to imagine there's a better way to achieve what's needed here without blocking the boot process by design.

Additionally, now understanding that the boot process is blocked by the user data, I've experimented with multiple approaches to making the user data exit quickly and this indeed solves the issue without the need to remove the cloud-final.service from the ecs.service as a dependency. i.e. systemctl enable --now --no-block <service> (as indicated) is a viable workaround, as is something like using & after the shell command to prevent the command from blocking (something like systemctl start <service> &. In both cases, <service> could be the ecs.service, or a custom service which has the ecs.service as a dependency.

ericchaves commented 5 years ago

Hi @petderek and @all I'm facing a similar trouble when I follow the steps described in this AWS blog post to install the rex-ray plugin in a ECS Optimized AMI (ie I'm not manually installing the ecs agent/service). When I do a curl looping to http://localhost:51678/v1/metadata in the user-data script to wait for the ecs service, it never stabilize (never starts) but without this loop it starts ok.

Could this be the same issue, and if so how should I circumvent it? should I manually start ecs using systemctl enable --now --no-block ecs in my user-data script?

petderek commented 5 years ago

@ericchaves yes, this would be the same issue. The metadata endpoint won't be alive until agent starts, which by default won't happen until after the userdata script is finished.

ericchaves commented 5 years ago

@petderek , so is the user-data script below expected to work?

#!/bin/bash
systemctl enable --now --no-block ecs
 docker plugin install rexray/ebs REXRAY_PREEMPT=true EBS_REGION=<AWS-REGION> --grant-all-permissions
systemctl restart docker
systemctl restart ecs

update: I tried some variations of the code above and it still blockig the user-data script, so in the end the question is how (or where) should I do to restart docker and ecs service when provisioning new ECS optimized instances?

philippefuentes commented 5 years ago

Hi all, Finally found this issue discussion as I am running into this problem too, trying to migrate our userdata script from "Amazon Linux AMI" to "Amazon Linux 2 AMI". Find it quite incredible that no documentation can be found about this problem and the correct steps to follow to simply start the ecs agent from userdata script...

Anyway, I'd just like to replace my old command:

start ecs

with the one(s) compatible with Linux 2 AMI. Can someone tell me what I have to do please ?, is systemctl enable --now --no-block ecs enough and can I fully rely on it ? (we're migrating on Linux 2 AMI as a first step of migrating to ARM bases a1 instance in production)

Here is our full user data script we have to migrate:

#!/bin/bash

yum update -y

echo ECS_CLUSTER="${ECS_CLUSTER_NAME}" >> /etc/ecs/ecs.config
echo ECS_LOGLEVEL=debug >> /etc/ecs/ecs.config

start ecs

yum install -y awslogs jq

# Inject the CloudWatch Logs configuration file contents
cat > /etc/awslogs/awslogs.conf <<- EOF
[general]
state_file = /var/lib/awslogs/agent-state

[/var/log/dmesg]
file = /var/log/dmesg
log_group_name = /var/log/dmesg
log_stream_name = {cluster}/{container_instance_id}

[/var/log/messages]
file = /var/log/messages
log_group_name = /var/log/messages
log_stream_name = {cluster}/{container_instance_id}
datetime_format = %b %d %H:%M:%S

[/var/log/docker]
file = /var/log/docker
log_group_name = /var/log/docker
log_stream_name = {cluster}/{container_instance_id}
datetime_format = %Y-%m-%dT%H:%M:%S.%f

[/var/log/ecs/ecs-init.log]
file = /var/log/ecs/ecs-init.log
log_group_name = /var/log/ecs/ecs-init.log
log_stream_name = {cluster}/{container_instance_id}
datetime_format = %Y-%m-%dT%H:%M:%SZ

[/var/log/ecs/ecs-agent.log]
file = /var/log/ecs/ecs-agent.log.*
log_group_name = /var/log/ecs/ecs-agent.log
log_stream_name = {cluster}/{container_instance_id}
datetime_format = %Y-%m-%dT%H:%M:%SZ

EOF

#######################
# cloudwatch setup
#######################

# Set the region to send CloudWatch Logs data to (the region where the container instance is located)
region=$(curl -s 169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region)
sed -i -e "s/region = us-east-1/region = $region/g" /etc/awslogs/awscli.conf

script
    exec 2>>/var/log/ecs/cloudwatch-logs-start.log
    set -x

    until curl -s http://localhost:51678/v1/metadata
    do
        sleep 1
    done

    # Grab the cluster and container instance ARN from instance metadata (ECS Agent Introspection)
    # Ref: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-introspection.html
    cluster=$(curl -s http://localhost:51678/v1/metadata | jq -r '. | .Cluster')
    container_instance_id=$(curl -s http://localhost:51678/v1/metadata | jq -r '. | .ContainerInstanceArn' | awk -F/ '{print $2}' )

    # Replace the cluster name and container instance ID placeholders with the actual values
    sed -i -e "s/{cluster}/$cluster/g" /etc/awslogs/awslogs.conf
    sed -i -e "s/{container_instance_id}/$container_instance_id/g" /etc/awslogs/awslogs.conf

    service awslogs start
    chkconfig awslogs on

end script

Thank you

dpavlov-smartling commented 5 years ago

@philippefuentes it looks like you are using ECS optimized AMI, so yes, in such case you should use systemctl enable --now --no-block ecs to start ecs agent

philippefuentes commented 5 years ago

@dpavlov-smartling thank for your quick answer and sorry for not being specific enough, I indeed use ECS Optimized AMI

Unfortunately, it does not work for me, after ssh(ing) on a fresh new instance , the agent is not started using systemctl enable --now --no-block ecs

[ec2-user@ip-10-1-1-146 ~]$ systemctl status ecs
● ecs.service - Amazon Elastic Container Service - container agent
   Loaded: loaded (/usr/lib/systemd/system/ecs.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
     Docs: https://aws.amazon.com/documentation/ecs/

philippefuentes commented 5 years ago

Even in the context of a userdata script, none of the commands suggested in this discussion worked for me, the only way I can seem to start the agent is by editing the conf file as suggested by @mixja , so by finally replacing:

start ecs

with:

sed -i '/After=cloud-final.service/d' /usr/lib/systemd/system/ecs.service
systemctl daemon-reload

In my userdata script, the agent is started correctly. But it is not really appealing to me...should I use this in production ?

Additional info, I'm using this AMI: ami-09cd8db92c6bf3a84

11:47 $ aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2/recommended --region eu-west-1
{
    "Parameters": [
        {
            "Name": "/aws/service/ecs/optimized-ami/amazon-linux-2/recommended",
            "Type": "String",
            "Value": "{\"schema_version\":1,\"image_name\":\"amzn2-ami-ecs-hvm-2.0.20190402-x86_64-ebs\",\"image_id\":\"ami-09cd8db92c6bf3a84\",\"os\":\"Amazon Linux 2\",\"ecs_runtime_version\":\"Docker version 18.06.1-ce\",\"ecs_agent_version\":\"1.27.0\"}",
            "Version": 12
        }
    ],
    "InvalidParameters": []
}

dpavlov-smartling commented 5 years ago

We don't use ECS optimized AMI, so we need to install ECS agent manually and following command does work properly: amazon-linux-extras disable docker && amazon-linux-extras install -y ecs && systemctl enable --now --no-block ecs

philippefuentes commented 5 years ago

It looks ok in this context indeed, I wish straightforward starting steps could be done when using ECS Optimized AMI. thx

ericchaves commented 5 years ago

Hi all, following @philippefuentes suggestion I was able to adjust my user-data script (I'm also using Amazon Linux 2 ECS Optmimize ami).

I'm sharing my final user-data script for others facing similar trouble and by chance find this issue (until aws docs & post got updated I hope =) ).

#!/bin/bash
          yum install -y aws-cfn-bootstrap
          yum update -y
          /opt/aws/bin/cfn-init -v --stack ${AWS::StackName} --resource ECSInstanceConfiguration --region ${AWS::Region}      
          sed -i '/After=cloud-final.service/d' /usr/lib/systemd/system/ecs.service
          systemctl daemon-reload
          exec 2>>/var/log/ecs-agent-reload.log
          set -x
          until curl -s http://localhost:51678/v1/metadata; do sleep 1; done
          docker plugin install rexray/ebs REXRAY_PREEMPT=true EBS_REGION=${AWS::Region} --grant-all-permissions
          systemctl restart docker
          systemctl restart ecs
          /opt/aws/bin/cfn-signal -e $? --stack ${AWS::StackName} --resource ECSScalingGroup --region ${AWS::Region}

Cheers!

cixelsyd commented 4 years ago

The AWS docs really, really should mention --no-block for Amazon Linux2 AMI, because this is an absolutely ridic. issue to troubleshoot when following the instructions on the Amazon ECS doc pages. The Amazon ECS doc pages for 'how to install and run the ecs cluster agent' make the process seem trivial... but then you hit a race condition that only magically resolves itself if you land here and add the 'no-block' or you kill the systemctl process and notice that all of a sudden the ecs-init process that follows it seems to make things work.

markuman commented 3 years ago

Nothing what is mentioned here worked for me. I guess there is something fucked up with the systemd asynchron startup process (depencencies).
The safest workaround is to disable the ecs service and add an ecs.timer on the ecs service, whitch starts the ecs service with a little delay.

[Timer]
OnBootSec=15s

[Install]
WantedBy=basic.target

Ichimikichiki commented 3 years ago

lol do you just make your customers go around in circles following instructions that don't work?

This is pretty embarrassing...

sparrc commented 3 years ago

Hello everyone, I am not currently aware of any ECS docs page that recommends running systemctl start ecs in userdata on the AL2 platform. We have this note in the docs page about installing the ecs container agent that I think clarifies this behavior, so I'm not sure exactly what more we can do to help with this issue (from https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-install.html):

Are there any ECS docs that currently direct users to start ecs in AL2 userdata? If so please provide URLs and we can fix them ASAP.

jk2l commented 3 years ago

Hello everyone, I am not currently aware of any ECS docs page that recommends running systemctl start ecs in userdata on the AL2 platform. We have this note in the docs page about installing the ecs container agent that I think clarifies this behavior, so I'm not sure exactly what more we can do to help with this issue (from https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-install.html):

Are there any ECS docs that currently direct users to start ecs in AL2 userdata? If so please provide URLs and we can fix them ASAP.

I think there is one advanced use case for using ECS with Rexray to mount EBS.

https://aws.amazon.com/blogs/compute/amazon-ecs-and-docker-volume-drivers-amazon-ebs/

This blog written by AWS is base on amazon linux (AL) not amazon linux 2 (AL2). but I would like to use AL2 over the old AL.

the step involved are required to restart the ecs service after plugin installed. but atm the stop/start and restart are being blocked with the block here

saranglakare commented 2 years ago

I was facing the same issue till I realized that there was no need to re-start the ecs service for Rexray device. I think the document is old (mentions Amazon Linux 1). As per comments above, the ECS service is not even running at this point and is going to be started at the end of user data execution anyway, so why do we need to start it within user data at all?

My docker plugin ls shows Rexray as enabled. So I am guessing removing the ECS stop / start script is the right way to do this.

saranglakare commented 2 years ago

Hello everyone, I am not currently aware of any ECS docs page that recommends running systemctl start ecs in userdata on the AL2 platform.

I think the main problem is that there is no mention of how to start the ecs service when running AL2 from User data. In the linked page, in step 4 you mention sudo amazon-linux-extras install -y ecs; sudo systemctl enable --now ecs. However, you are asking to perform these steps by logging into the machine (step 2). This is ok for manual configuration. But when using something like ASG, devs depend on user data to install ecs and start it.

One clarification I need: if we install ecs using sudo amazon-linux-extras install -y ecs, will it auto-start the service after user data init is complete? If that's the case, then all problems are resolved.

In any case, I strongly feel you should have a section in the above link on How to enable ECS from User data. I just wasted 2 days simply trying to solve the ecs agent getting locked issue! So I feel many other people probably go through this horrid experience. Thanks!

sagungargs15 commented 2 years ago

@petderek I am also trying to create a persistence storage using EBS volume in ECS via Container instances from EC2 (AMI amazon linux 2 (AL2)). My goal is to attach existing EBS volume by using the same name of the EBS volume. The concerns are around the EC2 user_data script, task_definition used in light of autoprovision = false

I found these interesting blogs around the subject:

@saranglakare

Please can you share your user_data script here (exact copy paste)
What precautions did you take to make sure Ec2 instance got attached to the mentioned ECS cluster in user_data by using AMI amazon linux 2 (AL2). (I am experiencing issues where attachment is failing and I suspect 3 things are mandatory a) ECS compatible AMI b) Appropriate I AM role while configuring Ec2 with permissions for ECS c) Public IP of the instance
Finally did removing the extra ecs script from the user_data i.e removing the 3rd line "So I am guessing removing the ECS stop / start script is the right way to do this." did it work for you ? - Is the assumption here that rex-ray will auto attach an extra volume apart from the root volume already pre-configured during the EC2 AMI2 launch ( I am trying to use t3xlarge AMI-id ami-0a2bfca3c9d16280d)

#!/bin/bash
echo "ECS_CLUSTER=portcast-eta-test" >> /etc/ecs/ecs.config

# install the REX-Ray Docker volume plugin
docker plugin install rexray/ebs REXRAY_PREEMPT=true EBS_REGION=ap-southeast-1 --grant-all-permissions

# restart the ECS agent. This ensures the plugin is active and recognized once the agent starts.
sudo systemctl restart ecs

In the task definition when specifying the autoprovision = false does it imply now I still need to mention the volume driver_options i.e the volumetype gp2 and size 5 ?

To use existing volume it says here needs to be used which then gets referenced in task definition based on the documentation https://aws.amazon.com/blogs/compute/amazon-ecs-and-docker-volume-drivers-amazon-ebs/

--tag-specifications 'ResourceType=volume,Tags=[{Key=Name,Value=rexray-vol}]'

saranglakare commented 2 years ago

I think this is a bit off topic for this thread. Your user data script is correct, just remove the last line to restart ecs which is simply not required! The plugin has nothing to do with ECS.. It's a Docker plugin. You can ssh to the machine and do docker plugin ls and you should see the plugin is installed and active.

Just say follow Step 3 onwards from this guide: https://aws.amazon.com/blogs/compute/amazon-ecs-and-docker-volume-drivers-amazon-ebs/

If you are using existing EBS volume, just use the name of the EBS volume in your Task definition (under volumes -> name), set autoprovision=false, set shared=true and you do not need to give any driverOpts.

The only restriction is that the volume has to be in the same availability zone as the instance running. So ensure you only allow same az instances - from the same az where your volume exists. Hope this helps!

sparrc commented 2 years ago

One clarification I need: if we install ecs using sudo amazon-linux-extras install -y ecs, will it auto-start the service after user data init is complete? If that's the case, then all problems are resolved.

In any case, I strongly feel you should have a section in the above link on How to enable ECS from User data. I just wasted 2 days simply trying to solve the ecs agent getting locked issue! So I feel many other people probably go through this horrid experience. Thanks!

Yes the service will be auto-started. The problem is I'm not exactly sure what we would write in a section titled "How to enable ECS from User data". ECS agent doesn't need to be enabled in userdata. If it's been installed via the rpm or deb package (or repos) then it's already been enabled. It will auto-start after userdata ends.

bemillenium commented 2 years ago

--no-block ecs.service

This work perfectly

aws / amazon-ecs-agent