aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.08k stars 616 forks source link

Cannot launch tasks on Ubuntu 22.04 #3227

Closed sunds closed 1 year ago

sunds commented 2 years ago

Summary

OS: Ubuntu 22.04 (LTS) ECS agent version="1.61.1" commit="8dc9fdeb"

Containers will not start.

Description

err=cgroupv2 create: unable to create v2 manager: dial unix /run/systemd/private: connect: no such file or directory

The problem is ECS agent runs in docker and /run/systemd/private is not mounted into the container. Editing the container config to add that bind mount worked around the problem.

Expected Behavior

Container runs

Observed Behavior

Launch fails due to missing bind mount

Environment Details

curl http://localhost:51678/v1/metadata {"Cluster":"dsunds-test-1","ContainerInstanceArn":"arn:aws:ecs:us-east-1:585275055393:container-instance/dsunds-test-1/17da2f096e234930a8ea495d5cb6b575","Version":"Amazon ECS Agent - v1.61.1 (8dc9fdeb)"}

lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04 LTS Release: 22.04 Codename: jammy

Deployed onto bare metal server

Supporting Log Snippets

Error from ECS agent log: cgroup: unable to create cgroup taskARN=arn:aws:ecs:us-east-1:585275055393:task/dsunds-test-1/383621ce97f643749b2c06061d345884 cgroupPath=ecstasks-383621ce97f643749b2c06061d345884.slice cgroupV2=true err=cgroupv2 create: unable to create v2 manager: dial unix /run/systemd/private: connect: no such file or directory"

The relevant part being that last error. Digging into the source it is trying to make a connection to the private DBUS socket

sunds commented 2 years ago

It is worth noting that direct access to /run/systemd/private happens only if the dbus daemon cannot be contacted:

// NewWithContext establishes a connection to any available bus and authenticates. // Callers should call Close() when done with the connection. func NewWithContext(ctx context.Context) (*Conn, error) { conn, err := NewSystemConnectionContext(ctx) if err != nil && os.Geteuid() == 0 { return NewSystemdConnectionContext(ctx) } return conn, err }

https://github.com/aws/amazon-ecs-agent/blob/8dc9fdeb7b876dad609b06001448d0d04e4825fe/agent/vendor/github.com/coreos/go-systemd/v22/dbus/dbus.go#L121

sunds commented 2 years ago

The problem was apparmor on this system blocking the call to DBUS.

apparmor_status apparmor module is loaded. 38 profiles are loaded. 37 profiles are in enforce mode. ... docker-default

Log:

May 27 03:13:12 garage kernel: [15540.770327] audit: type=1107 audit(1653621192.007:94): pid=759 uid=103 auid=4294967295 ses=4294967295 subj=? msg='apparmor="DENIED" operation="dbus_method_call" bus="system" path="/org/freedesktop/DBus" interface="org.freedesktop.DBus" member="Hello" mask="send" name="org.freedesktop.DBus" pid=5440 label="docker-default" peer_label="unconfined"

Adding --security-opt apparmor:unconfined to the docker run resolved this issue. However this is not the default when it is being installed from https://amazon-ecs-agent.s3.amazonaws.com/ecs-anywhere-install-latest.sh

Perhaps this issue should be moved to https://github.com/aws/amazon-ecs-init ?

Working command:

docker run \
  --name "/ecs-agent" \
  --runtime "runc" \
  --volume "/var/run:/var/run" \
  --volume "/var/log/ecs:/log" \
  --volume "/var/lib/ecs/data:/data" \
  --volume "/etc/ecs:/etc/ecs" \
  --volume "/var/cache/ecs:/var/cache/ecs" \
  --volume "/sys/fs/cgroup:/sys/fs/cgroup" \
  --volume "/var/lib/ecs:/var/lib/ecs" \
  --volume "/var/log/ecs/exec:/log/exec" \
  --volume "/etc/ssl:/etc/ssl:ro" \
  --volume "/root/.aws:/rotatingcreds:ro" \
  --volume "/run/docker/plugins:/run/docker/plugins:ro" \
  --volume "/etc/docker/plugins:/etc/docker/plugins:ro" \
  --volume "/usr/lib/docker/plugins:/usr/lib/docker/plugins:ro" \
  --volume "/var/lib/ecs/deps/execute-command/bin:/managed-agents/execute-command/bin:ro" \
  --volume "/var/lib/ecs/deps/execute-command/config:/managed-agents/execute-command/config" \
  --volume "/var/lib/ecs/deps/execute-command/certs:/managed-agents/execute-command/certs:ro" \
  --volume "/proc:/host/proc:ro" \
  --volume "/usr/lib:/usr/lib:ro" \
  --volume "/lib:/lib:ro" \
  --volume "/usr/lib64:/usr/lib64:ro" \
  --volume "/lib64:/lib64:ro" \
  --volume "/sbin:/host/sbin:ro" \
  --volume "/etc/alternatives:/etc/alternatives:ro" \
  --volume "/usr/sbin:/usr/sbin:ro" \
  --log-driver "json-file" \
  --log-opt max-file="4" \
  --log-opt max-size="16m" \
  --restart "" \
  --network "host" \
  --hostname "garage" \
  --expose "51678/tcp" \
  --expose "51679/tcp" \
  --env "ECS_DATADIR=/data" \
  --env "ECS_ENABLE_TASK_IAM_ROLE=true" \
  --env "ECS_UPDATE_DOWNLOAD_DIR=/var/cache/ecs" \
  --env "ECS_EXTERNAL=true" \
  --env "ECS_CLUSTER=dsunds-test-1" \
  --env "ECS_LOGFILE=/log/ecs-agent.log" \
  --env "ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true" \
  --env "ECS_VOLUME_PLUGIN_CAPABILITIES=[\"efsAuth\"]" \
  --env "ECS_UPDATES_ENABLED=true" \
  --env "ECS_AVAILABLE_LOGGING_DRIVERS=[\"json-file\",\"syslog\",\"awslogs\",\"fluentd\",\"none\"]" \
  --env "ECS_AGENT_LABELS=" \
  --env "ECS_AGENT_CONFIG_FILE_PATH=/etc/ecs/ecs.config.json" \
  --env "SSL_CERT_DIR=/etc/ssl/certs" \
  --env "ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE=true" \
  --env "AWS_DEFAULT_REGION=us-east-1" \
  --env "ECS_ENABLE_TASK_ENI=false" \
  --env "PATH=/host/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
  --detach \
  --entrypoint "/agent" \
  --security-opt apparmor:unconfined \
  "amazon/amazon-ecs-agent:latest" 
Realmonia commented 2 years ago

Thanks for reporting! Currently Ubuntu22 is not an officially supported platform. ref This is tracked internally and will post there about the updates

shanet commented 2 years ago

I ran into the same issue and fixed it by adding a custom apparmor profile that allows access to dbus as such:

#include <tunables/global>

profile docker-ecs-agent flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>
  network,
  capability,
  file,
  umount,

  # Host (privileged) processes may send signals to container processes.
  signal (receive) peer=unconfined,
  # dockerd may send signals to container processes (for "docker kill").
  signal (receive) peer=unconfined,
  # Container processes may send signals amongst themselves.
  signal (send,receive) peer=docker-datadog-agent,

  deny @{PROC}/* w,   # deny write for all files directly in /proc (not in a subdir)
  # deny write to files not in /proc/<number>/** or /proc/sys/**
  deny @{PROC}/{[^1-9],[^1-9][^0-9],[^1-9s][^0-9y][^0-9s],[^1-9][^0-9][^0-9][^0-9]*}/** w,
  deny @{PROC}/sys/[^k]** w,  # deny /proc/sys except /proc/sys/k* (effectively /proc/sys/kernel)
  deny @{PROC}/sys/kernel/{?,??,[^s][^h][^m]**} w,  # deny everything except shm* in /proc/sys/kernel/
  deny @{PROC}/sysrq-trigger rwklx,
  deny @{PROC}/kcore rwklx,
  deny mount,
  deny /sys/[^f]*/** wklx,
  deny /sys/f[^s]*/** wklx,
  deny /sys/fs/[^c]*/** wklx,
  deny /sys/fs/c[^g]*/** wklx,
  deny /sys/fs/cg[^r]*/** wklx,
  deny /sys/firmware/** rwklx,
  deny /sys/kernel/security/** rwklx,

  # suppress ptrace denials when using 'docker ps' or using 'ps' inside a container
  ptrace (trace,read,tracedby,readby) peer=docker-datadog-agent,

  # suppress ptrace denials when agent and process-agent are accessing /proc
  ptrace (read),

  # The ECS Agent needs access to dbus in order to launch tasks
  dbus (send, receive, bind),
}

Then ran systemctl reload apparmor to pick up the new profile and finally ran the ECS agent task with --security-opt apparmor=docker-ecs-agent to use it.

stuart-warren commented 2 years ago

After talking to Canonical support about this, just to get everything straight in my head I believe the issue is:

Ubuntu 22 now used cgroupv2 which is a change, so

https://github.com/aws/amazon-ecs-agent/blob/5fb8e801e9d11630c3470abb778291913ebb9d7f/agent/taskresource/cgroup/control/cgroupv2_controller_linux.go#L52

calls

https://github.com/aws/amazon-ecs-agent/blob/5fb8e801e9d11630c3470abb778291913ebb9d7f/agent/vendor/github.com/containerd/cgroups/v2/manager.go#L737

a function that attempts to call org.freedesktop.DBus.Hello as part of the connection process

if that fails it will try to use the /run/systemd/private socket directly as mentioned above

Ubuntu 22 allows the docker-default apparmor profile to contact dbus, but not call org.freedesktop.DBus.Hello only peer to peer connections

ecs-init doesn't currently mount in the /run/systemd/private socket to the ecs-agent container

If you have the ability to tweak the apparmor profile then the above post may work for now, we are on ubuntu core 22 without that ability and have already had to patch ecs-init to make start, so will probably have to add in the extra container mount point to our local patch

sunds commented 2 years ago

Thanks for the additional detail.

I recommend you either run the agent with --security-opt apparmor:unconfined or load a new apparmor profile for Docker that allows the dbus call. Running the agent with unconfined should not increase risk as it already has broad permissions and host networking.

If you want to use a modified profile, the one posted by @shanet is good. If you want to double check start with the Docker default profile https://github.com/moby/moby/tree/master/profiles/apparmor and add the extra dbus directive. You can scope it a bit more tightly:

# ECS agent requires DBUS send
dbus (send)
  bus=system,

Here is my complete profile as of several weeks ago:

#include <tunables/global>

profile docker-default flags=(attach_disconnected, mediate_deleted) {

#include <abstractions/base>

network,
capability,
file,
umount,

# Host (privileged) processes may send signals to container processes.
signal (receive) peer=unconfined,
# dockerd may send signals to container processes (for "docker kill").
signal (receive) peer=unconfined,
# Container processes may send signals amongst themselves.
signal (send,receive) peer=docker-default,

# ECS agent requires DBUS send
dbus (send)
  bus=system,

deny @{PROC}/* w,   # deny write for all files directly in /proc (not in a subdir)
# deny write to files not in /proc/<number>/** or /proc/sys/**
deny @{PROC}/{[^1-9],[^1-9][^0-9],[^1-9s][^0-9y][^0-9s],[^1-9][^0-9][^0-9][^0-9/]*}/** w,
deny @{PROC}/sys/[^k]** w,  # deny /proc/sys except /proc/sys/k* (effectively /proc/sys/kernel)
deny @{PROC}/sys/kernel/{?,??,[^s][^h][^m]**} w,  # deny everything except shm* in /proc/sys/kernel/
deny @{PROC}/sysrq-trigger rwklx,
deny @{PROC}/kcore rwklx,

deny mount,

deny /sys/[^f]*/** wklx,
deny /sys/f[^s]*/** wklx,
deny /sys/fs/[^c]*/** wklx,
deny /sys/fs/c[^g]*/** wklx,
deny /sys/fs/cg[^r]*/** wklx,
deny /sys/firmware/** rwklx,
deny /sys/kernel/security/** rwklx,

# suppress ptrace denials when using 'docker ps' or using 'ps' inside a container
ptrace (trace,read,tracedby,readby) peer=docker-default,
}

Write this file into /etc/apparmor.d/docker-default

You can install docker and then overwrite the default profile with this command:

apparmor_parser -r docker-default If this works for your case then a modified ecs-init should not be necessary.

Alternatively if you are modifying ecs-init you can run just the agent with the modified profile or unconfined.

--security-opt apparmor=your_agent_profile or --security-opt apparmor:unconfined

yoelvd commented 1 year ago

Thanks to @sunds and @shanet today I could run some task in our ECS cluster with external on-prem docker instance which is running ubuntu22.04. Thanks again bros, keep it going!

chienhanlin commented 1 year ago

Thanks @sunds and @shanet very much for bringing up this issue and sharing workaround with us. I am able to reproduce the issue, and use the custom AppArmor profile as a workaround.


Repro setup

As Ubuntu 22.04 is not officially support by ECS Anywhere, and workarounds are available, this issue will be closed. Please feel free to open new issues and track the latest supported operating systems and system architectures via the public documentation.

Thanks.

sparrc commented 10 months ago

Hi everyone, this is now supported in agent/init version 1.80.0: https://github.com/aws/amazon-ecs-agent/releases.

Support was added via this PR: https://github.com/aws/amazon-ecs-agent/pull/4062

Working on updating the docs now.