Closed dm03514 closed 7 years ago
Hi @dm03514, thanks for filing the issue. I'd like a little more information before I can help though. Could you add the output of docker info
as well as your task definition? If you're not comfortable putting it on github, feel free to email me at jushay at amazon dot com.
Thanks, Justin
$docker info
Containers: 3
Running: 3
Paused: 0
Stopped: 0
Images: 2
Server Version: 1.12.6
Storage Driver: devicemapper
Pool Name: docker-docker--pool
Pool Blocksize: 524.3 kB
Base Device Size: 10.74 GB
Backing Filesystem: ext4
Data file:
Metadata file:
Data Space Used: 1.752 GB
Data Space Total: 26.54 GB
Data Space Available: 24.79 GB
Metadata Space Used: 647.2 kB
Metadata Space Total: 29.36 MB
Metadata Space Available: 28.71 MB
Thin Pool Minimum Free Space: 2.654 GB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Library Version: 1.02.93-RHEL7 (2015-01-28)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: host null bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.4.51-40.58.amzn1.x86_64
Operating System: Amazon Linux AMI 2016.09
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.307 GiB
Name: ip-10-0-116-202
ID: XYQI:QZAZ:VXWH:MJPR:SMKN:ZYO4:EQUK:ZRTQ:27BO:6W25:6E2H:ONRX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
127.0.0.0/8
Emailing you the task definition, Thank you
Thanks for the additional info @dm03514. Can you confirm a few suspicions I have?
Assuming those are both true, what you're seeing is related to https://github.com/aws/amazon-ecs-agent/issues/124#issuecomment-152307508. Docker's --memory flag, and the associated api, default to configuring swap memory equal to the amount requested in the flag.
Thank you, i'll check first thing tomorrow morning (EST)
It looks like we do not have swap enabled based on:
free
, there's no swap memory on the machines"And IO burst balance was observed to be dropping by the person helping us with AWS support
Thank you for the info. I haven't been able to reproduce this on my end, but if you have repro steps, please let me know.
Could you send me the container instance arn, docker logs, and ecs-agent logs on an instance where this is happening? You can use the ECS Logs Collector to grab the logs as well as some helpful system logs.
We're debugging the exact same issue here, but I believe the issue lies with the kernel and not the ECS agent or even docker. (the oomkiller lives in the kernel)
Very basic containers (a few different nodejs-based apps, one collectd) container, reaches its memory limit, observed to sit between 99.9 and 100% of the limit, starts chewing through IO read on the docker volume, which eventually exhausts our burst balance, at which point the host (and other workloads) become pretty unhappy. The container may or may not be eventually killed OOM, but not as soon as one would expect.
In one case I directly observed, docker stats
reported the container in question flapping between 99 and 100% usage, but it was only killed OOM after almost an hour in that state. syslogs confirm the kernel didn't consider killing it until then.
A few things that seem relevant to note:
/cgroup/memory/docker/<container>/memory.memsw.limit_in_bytes
to match /cgroup/memory/docker/<container>/memory.limit_in_bytes
, to attempt to disable swap usage for all containers (where by default it is 2X the memory limit)vm.swappiness=0
@dm03514 if we figure it out I'll make sure you hear about it, would appreciate the same!
@bobzoller absolutely, i was afraid the problem was going to be in the OS :( debbuging those sorts of issues is pretty over my experience level. Have you happened to have any success with any different kernel versions :p looking for the easy way out :)
@swettk and I spent more time with this today, and have plausible theory: as the container approaches its memory limit, it causes major page faults. This could be why we see high reads but almost no writes, and they are reads from the docker disk, not the swap disk. It eventually crosses the actual cgroup memory limit and will then get killed OOM, but while it hangs out at that boundary you may end up thrashing your disk.
we're personally planning to investigate:
@bobzoller and/or @dm03514 I'd like to add one more avenue of investigation. Try increasing your task's memory limit a bit.
The page faults you're seeing are likely due to the page cache being partially flushed to free up more process memory. This, in turn, causes your application to need to re-read portions of itself (or its dependencies) from disk. Depending on your application's structure, this can cause a fairly tight feedback loop where for any unit of work to proceed, lots of disk IO will occur.
correct @jhaynes, as I said major page faults are absolutely the problem and increasing the memory limit will resolve it until you bump up against the limit again.
as we're striving for container isolation and protecting the health of the host, we chose to write a simple reaper that runs on every ECS instance and stops containers that have crossed a major page fault threshold we chose based on our environment (happy containers might cause 300/day, and sad containers can rack up hundreds of thousands within a few minutes). running it every minute using cron has been effective: these containers are now killed off within 60 seconds of them starting to thrash the disk, and the host recovers without intervention. ECS reschedules the container if necessary, and we notify the responsible engineer so they can investigate later. :ok_hand:
Our script looks something like this:
#!/bin/sh
# don't kill containers using these images even if they're misbehaving
EXCLUDES_PATTERN=$(cat <<'EOF' | xargs | sed 's/ /|/g'
amazon/amazon-ecs-agent
EOF
)
# list all the candidate containers
targets=$(docker ps --no-trunc --format '{{.ID}} {{.Image}}' | grep -Ev "$EXCLUDES_PATTERN" | awk '{ print $1; }' | xargs)
for target in $targets; do
cd "/cgroup/memory/docker/$target" || exit
info="id=$target $(docker inspect --format 'image={{.Config.Image}} StartedAt="{{.State.StartedAt}}"' "$target") pgmajfault=$(grep total_pgmajfault memory.stat | awk '{print $2;}')"
value=$(echo "$info" | awk '{ print $4;}' | sed 's/pgmajfault=//g')
if [ "$value" -gt 10000 ]; then
echo "Executing docker stop on container due to $value major page faults ($info)"
docker stop "$target" &
fi
cd - || exit
done
wait
HTH!
@dm03514 I'm inclined to close this since it isn't directly related to an ECS issue. However, if you or @bobzoller wind up with other questions or issues, feel free to open bugs here or engage directly with AWS Support.
@jhaynes @bobzoller Just ran into this issue myself and am wondering whether an "out-of-agent" cron job is still the recommended course of action?
we still run our cron job "reaper" just in case, but since moving off Amazon Linux onto Ubuntu we haven't seen a single occurrence. I'd assume this is more to do with kernel version and less to do with distro, but I can't tell you for sure. FWIW we're currently running kernel 4.13.0-31-generic
on Ubuntu Xenial 16.04.
@bobzoller thanks. I am seeing this on an Amazon ECS-Optimized Amazon Linux AMI 2017.09.f
$ uname -r
4.9.75-25.55.amzn1.x86_64
Thanks @bobzoller For the wonderful script... It seems like the above script needs some updates for the newly setup ecs hosts.
#!/bin/bash -e
##
# Use this annotated script as base for killing container misbehaving on reaching memory limit
#
# Requirements:
# - `jq` must be installed on ecs machine
##
# don't kill containers using these images even if they're misbehaving
EXCLUDES_PATTERN=$(cat <<'EOF' | xargs | sed 's/ /|/g'
amazon/amazon-ecs-agent
EOF
)
# list all the candidate containers
targets=$(docker ps --no-trunc --format '{{.ID}} {{.Image}}' | grep -Ev "$EXCLUDES_PATTERN" | awk '{ print $1; }' | xargs)
for target in $targets; do
# get taskid and dockerid from ecs
task=$(curl -s http://localhost:51678/v1/tasks?dockerid=$target)
taskId=$(echo $task | jq -r ".Arn" | cut -d "/" -f 2)
dockerId=$(echo $task | jq -r ".Containers[0] .DockerId")
memoryStatsFile="/cgroup/memory/ecs/$taskId/$dockerId/memory.stat"
# skip current target if cannot find memory stats file, might not be managed by ecs
if ! [ -f $memoryStatsFile ]
then echo "Memory stats not found for taskId=$taskid dockerId=$dockerId" && continue
fi
info="id=$target $(docker inspect --format 'image={{.Config.Image}} StartedAt="{{.State.StartedAt}}"' "$target") pgmajfault=$(grep total_pgmajfault $memoryStatsFile | awk '{print $2;}')"
majorPageFaults=$(echo "$info" | awk '{ print $4;}' | sed 's/pgmajfault=//g')
if [ "$majorPageFaults" -gt 5000 ]; then
echo "Stopping container due to major page faults exceeding threshold ($info)"
docker stop "$target"
fi
done
We are also having the same problem on Amazon Linux AMI 2017.09
. A container uses up all it's available memory and starts thrashing reads. Container is pretty much unavailable until its eventually killed off.
Besides the reaper cron, has anyone found a reasonable solution?
amzn-ami-2017.09.i-amazon-ecs-optimized is still affected by the issue. Is there any plan to provide a kernel compatible with docker for the "ecs-optimized" ami ? The Ubuntu "solution" and the reaper-cron "solution" do not feel really sound.
We hit this issue ourselves when someone configured to little memory to a task.
I think one part of the problem is that the container never reaches its memory limit. I tested this by giving a container that requires 128MB RAM just to start, only 8 MB.
The container (according to quay.io/vektorlab/ctop, docker run --rm -ti --name=ctop -v /var/run/docker.sock:/var/run/docker.sock quay.io/vektorlab/ctop:latest
) never reaches more than 6MB before trashing the system with disk I/O. My exceptions are probably wrong, but the hard limit is in my view the trigger point for when ecs-agent/docker/kernel should kill the process, since it is way out of the expected threshold of operations.
My biggest annoyance with this, is that it is really hard to detect. I could use the scripte provided by vikalpj, and log the output to a log group in CloudWatch, and trigger an alarm on new events. But that is not my expectations of a the ECS product, I expect it to kill the container and inform me why. Now it just trashed the disk.
@jhaynes, are you open to re-opent this issue, or look at alternatives to log this with ecs-agent ?
We hit this issue ourselves when someone configured to little memory to a task.
I think one part of the problem is that the container never reaches its memory limit. I tested this by giving a container that requires 128MB RAM just to start, only 8 MB.
The container (according to quay.io/vektorlab/ctop,
docker run --rm -ti --name=ctop -v /var/run/docker.sock:/var/run/docker.sock quay.io/vektorlab/ctop:latest
) never reaches more than 6MB before trashing the system with disk I/O. My exceptions are probably wrong, but the hard limit is in my view the trigger point for when ecs-agent/docker/kernel should kill the process, since it is way out of the expected threshold of operations.
Yes, all the fs cache has disappeared, but the application is not "out of memory". The application did allocate only 6MB, but when the kernel needs to access the code of the application, it is not available in memory so it has to read it from disk. As if it was running 100% on swap except the heap memory segment..
The workaround I have is to configure ecs tasks with a memory reservation ( aka "soft" ) big enough to fit the process image and all the files needed by the application. Then you hope that your application will never break the limit or if it does, it will be a big allocation that will break the limit at once before any disk trashing occurs allowing the oom killer to destroy your process.
Obviously you have to spend some time reading docker stats for your workload.. And if your application leaks slowly you will hit the problem again and again..
Maybe some fine tuning on sys/vm could fix or serioulsy alleviate the issue, but I would like to have the official ecs ami configured with a correct setting.
this issue, or one very similar to it, appears to still be present (hello from the tail-end of 2019). Is there any official documentation relating to how to approach this issue, as it appears to have been closed intentionally not-fixed?
The memory allocation limits enforcement is carried out by a host operating system's cgroups and oom killer. Just as we don't expect that the whole os will shut down just because some of the processes it runs has eaten up the memory(*), we shouldn't expect that from containers. In fact, they are not much more than just processes running on a system. What we usually observe is the oom killer ending the processes that cause the exhaustion.
In the case of containers that according to good practices contain only one process, killing by the oom killer will have an effect of terminating container, as this particular PID 1 process and container is the same thing.
The problem begins when an additional manager is introduced on the container. This can be by forking additional processes or by using tools like the supervisor, systemd, etc. Here's the example with plain Docker:
docker run -it -m 1024M --memory-swap=1024M --entrypoint /bin/sh debian -c "apt-get update && apt-get install stress-ng -y; stress-ng --brk 4 --stack 4 --bigheap 4"
The container allocates almost all of the available memory. The CPU usage and disk reads are high (the lack of memory causes that the app and libraries files can't be saved in the cache and are constantly reread for execution purposes).
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
1d2ba95930f2 awesome_mendeleev 669.94% 1022MiB / 1GiB 99.76% 11.3MB / 44.6kB 18.6GB / 25.9MB 26
At the same time, the oom killer tries to end processes that caused the exhaustion:
Mar 04 18:08:52 thalap kernel: Memory cgroup out of memory: Killed process 5080 (stress-ng-brk) total-vm:191364kB, anon-rss:160880kB, file-rss:376kB, shmem-rss:4kB, UID:100000000 pgtables:384kB oom_score_adj:1000
Mar 04 18:08:52 thalap kernel: Memory cgroup out of memory: Killed process 5094 (stress-ng-bighe) total-vm:177008kB, anon-rss:146520kB, file-rss:504kB, shmem-rss:4kB, UID:100000000 pgtables:360kB oom_score_adj:1000
Mar 04 18:08:52 thalap kernel: oom_reaper: reaped process 5080 (stress-ng-brk), now anon-rss:0kB, file-rss:0kB, shmem-rss:4kB
Mar 04 18:08:52 thalap kernel: oom_reaper: reaped process 5094 (stress-ng-bighe), now anon-rss:0kB, file-rss:0kB, shmem-rss:4kB
...
which are then spawned again and again according to the stress-ng docs: "If the out of memory killer (OOM) on Linux kills the worker or the allocation fails then the allocating process starts all over again." Many apps behave similar. What else could they do to keep working if some of their workers have been stopped/killed?
Theoretically, even if the enforcement of limits was the ECS agent's responsibility, as a result of the OS intervention, memory usage stays below the given threshold, and the agent wouldn't be able to take any actions.
How to approach that?
(*)For non-containerized systems, this is actually possible with the kernel setting: vm.panic_on_oom = 1.
I am running a container and when the hard task memory limit is reached it is not killed. In addition to not dying it begins to do a large amount of
docker.io.read_bytes
(observed from ECS datadog integration).Agent version
1.14.1
Stats shows that the container Id frequently reaches 100% memory complete, and shows BLOCK I/O perpetually increasing (the application should only be using BLOCK I/O to read a configuration file during startup)
The container remains up:
Sometimes the agent IS able to kill the container after 10-20 minutes:
Also once the container is in a 100% state, if I try to
exec -it <container_id> /bin/bash
it will hang for a while and then register the SIGKILL, almost like it finally recognizes SIGKILL only after I exec.The daemonization feature, and auto restart is critical to keeping resource depletion failures from taking down other services and would really appreciate any insight possible.
Thank you