Open hpakniamina opened 2 years ago
I have the same issue on Arch, also not consistently reproducible. Docker version 20.10.23, build 715524332f
I have the same issue on Arch, also not consistently reproducible. Docker version 20.10.23, build 715524332f
I did not need the networking features of the container, therefore passing "--network none" to docker run commandline circumvented the problem:
docker run ... --network none ...
It's happening to me when I am building my images. Sadly, it too is not able to be reproduced consistently.
docker build ...
I have the same behavior with docker build
command (cannot allocate memory)
# docker version
Client: Docker Engine - Community
Version: 23.0.0
API version: 1.42
Go version: go1.19.5
Git commit: e92dd87
Built: Wed Feb 1 17:47:51 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 23.0.0
API version: 1.42 (minimum version 1.12)
Go version: go1.19.5
Git commit: d7573ab
Built: Wed Feb 1 17:47:51 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.16
GitCommit: 31aa4358a36870b21a992d3ad2bef29e1d693bec
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
# apt list --installed | grep docker
docker-buildx-plugin/jammy,now 0.10.2-1~ubuntu.22.04~jammy amd64 [installed,automatic]
docker-ce-cli/jammy,now 5:23.0.0-1~ubuntu.22.04~jammy amd64 [installed]
docker-ce-rootless-extras/jammy,now 5:23.0.0-1~ubuntu.22.04~jammy amd64 [installed,automatic]
docker-ce/jammy,now 5:23.0.0-1~ubuntu.22.04~jammy amd64 [installed]
docker-compose-plugin/jammy,now 2.15.1-1~ubuntu.22.04~jammy amd64 [installed,automatic]
docker-scan-plugin/jammy,now 0.23.0~ubuntu-jammy amd64 [installed,automatic]
Exactly the same issue here during docker build
,
Rocky Linux 8.7 (RHEL 8.7 clone), Docker 20.10.22-3.el8
I fixed the prob using docker builder prune
command then run the build again
https://docs.docker.com/engine/reference/commandline/builder_prune
I fixed the prob using
docker builder prune
command then run the build again https://docs.docker.com/engine/reference/commandline/builder_prune
If one is dealing with an intermittent problem, then there is no guarantee the issue is resolved.
Same problem here, every x times, a build fails with failed to add the host ( ) <=> sandbox ( ) pair interfaces: cannot allocate memory.
. System info :
$ dnf list --installed docker\* containerd\* | cat
Installed Packages
containerd.io.x86_64 1.6.20-3.1.el8 @docker-ce-stable
docker-buildx-plugin.x86_64 0.10.4-1.el8 @docker-ce-stable
docker-ce.x86_64 3:23.0.2-1.el8 @docker-ce-stable
docker-ce-cli.x86_64 1:23.0.2-1.el8 @docker-ce-stable
docker-ce-rootless-extras.x86_64 23.0.2-1.el8 @docker-ce-stable
docker-compose-plugin.x86_64 2.17.2-1.el8 @docker-ce-stable
docker-scan-plugin.x86_64 0.23.0-3.el8 @docker-ce-stable
$ sudo docker info
Client:
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.10.4
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.17.2
Path: /usr/libexec/docker/cli-plugins/docker-compose
scan: Docker Scan (Docker Inc.)
Version: v0.23.0
Path: /usr/libexec/docker/cli-plugins/docker-scan
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 55
Server Version: 23.0.2
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 2806fc1057397dbaeefbea0e4e17bddfbd388f38
runc version: v1.1.5-0-gf19387a
init version: de40ad0
Security Options:
seccomp
Profile: builtin
Kernel Version: 4.18.0-425.13.1.el8_7.x86_64
Operating System: Rocky Linux 8.7 (Green Obsidian)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.4GiB
Name: x
ID: NUAJ:VDZR:RMDC:ASCP:5SEG:D4EF:OEIW:RY57:VXYI:5EZV:6F4F:D5RO
Docker Root Dir: /opt/docker_data
Debug Mode: false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
https://x/
Live Restore Enabled: false
Default Address Pools:
Base: 172.17.0.0/16, Size: 24
Base: 172.20.0.0/16, Size: 24
Base: 172.30.0.0/16, Size: 24
If I understand correctly, this is the same as https://bbs.archlinux.org/viewtopic.php?id=282429 which is fixed by this patch queued here.
I don't know if this helps but it's happening to me on Rocky Linux 8.7 as well, just like @hostalp.
We have the same issue on Ubuntu 20.04 since a few weeks.
/cc @akerouanton FYI (I see a potential kernel issue mentioned above)
We have the problem with an older kernel (5.15), so I do not think that there is a connection with the mentioned kernel bug.
I have the same problem with a debian 12 (6.1.0-9-amd64), but no problem with a debian 11 (5.10.0-21-amd64)
Same Problem on Ubuntu 22.04
Linux 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
This error is really annoying since all of our CI/CD Pipelines fail randomly.
Same problem on Ubuntu 22.04
on a server
I have the same problem with a debian 12 (6.1.0-9-amd64), but no problem with a debian 11 (5.10.0-21-amd64)
Same here. Moved a server to Debian 12 which is starting a new container once per day to create backups of docker volumes. After some days or weeks, starting the container fails until I restart the docker daemon.
I compared three of my servers with the failing one and the only thing that is different is that the vm.swappiness is set to 0 and that the server has no swap activated at all. If that helps
I compared three of my servers with the failing one and the only thing that is different is that the vm.swappiness is set to 0 and that the server has no swap activated at all. If that helps
We disabled swap on our failing servers, but it did not help.
I compared three of my servers with the failing one and the only thing that is different is that the vm.swappiness is set to 0 and that the server has no swap activated at all. If that helps
We disabled swap on our failing servers, but it did not help.
I thought of the other way around and wanted to check if enabling the swap helps π€
I compared three of my servers with the failing one and the only thing that is different is that the vm.swappiness is set to 0 and that the server has no swap activated at all. If that helps
That was EXACTLY the case on our servers.
Changing the values in /etc/sysctl.conf
from
vm.swappiness = 0
to
vm.swappiness = 60
and applying it with sysctl -p
solved it for. You saved my life! ;) I forgot that I had set this value.
Did anyone fix that error without enabling the swap because on that specific server it is not possible to enable the swap....
Configured swap as suggested here (disabled image pruning before each run) and the error appeared again after a few days.
I am still trying to fix this on Ubuntu 22.04
without a swap. My next guess is that i miss configured something in my compose files, which leads to a high number of network connections left open?!? I am not sure if that fixes that problem or if it really is due to a kernel error. I will report my findings here next week. If anyone has figured it out please feel free to comment.
My next guess is that i miss configured something in my compose files
As mentioned before, we did not need the networking, so "--network none" helped us to go around it. We don't have docker compose. We simply call docker a couple of thousand times. Docker container reads the input and writes into the output and the container is removed by "--rm". Our issue does not have anything to do with weird configurations or docker compose.
Have the same problem. vm.swappiness = 60 helped for a while, but now the problem is back again.
Same problem here. We have 89 services in our compose file. Running docker system prune
before the build usually solves the problem temporary.
The problem being seemingly random, I don't think changing vm.swapiness actually fixed the problem, it went away like it generally does and came back later. Some weeks we see this error every day, some weeks it doesn't happen at all.
This can happen if your system is running low on memory, or if the Docker daemon is configured to use a large amount of memory.
Increase the amount of memory that the Docker daemon is allowed to use. You can do this by editing the dockerd configuration file. The default setting is 2 gigabytes
Cleanup Docker Resources Docker may have leftover resources that are not properly cleaned up. You can remove unused containers, images, and volumes to free up resources
This can happen if your system is running low on memory, or if the Docker daemon is configured to use a large amount of memory.
Increase the amount of memory that the Docker daemon is allowed to use. You can do this by editing the dockerd configuration file. The default setting is 2 gigabytes
Cleanup Docker Resources Docker may have leftover resources that are not properly cleaned up. You can remove unused containers, images, and volumes to free up resources
I really don't think that this is the issue here. Our systems have more than enough resources and there is no limit set for dockerd.
I'm having the same issue
dpkg-deb (subprocess): decompressing archive '/var/cache/apt/archives/strace_6.1-0.1_amd64.deb' (size=1313880) member 'control.tar': lzma error: Cannot allocate memory
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors
dpkg-deb: error: tar subprocess returned error exit status 2
dpkg: error processing archive /var/cache/apt/archives/strace_6.1-0.1_amd64.deb (--unpack):
dpkg-deb --control subprocess returned error exit status 2
Errors were encountered while processing:
/var/cache/apt/archives/libunwind8_1.6.2-3_amd64.deb
/var/cache/apt/archives/strace_6.1-0.1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
E: Problem executing scripts DPkg::Post-Invoke 'rm -f /var/cache/apt/archives/*.deb /var/cache/apt/archives/partial/*.deb /var/cache/apt/*.bin || true'
E: Sub-process returned an error code
However, this is the latest call of Debian:BookWorm that is an older version of Docker and does not support it
But it was solved by modifying the configuration, and at run is added: --security-opt seccomp=unconfined
I'm having the same issue
dpkg-deb (subprocess): decompressing archive '/var/cache/apt/archives/strace_6.1-0.1_amd64.deb' (size=1313880) member 'control.tar': lzma error: Cannot allocate memory tar: This does not look like a tar archive tar: Exiting with failure status due to previous errors dpkg-deb: error: tar subprocess returned error exit status 2 dpkg: error processing archive /var/cache/apt/archives/strace_6.1-0.1_amd64.deb (--unpack): dpkg-deb --control subprocess returned error exit status 2 Errors were encountered while processing: /var/cache/apt/archives/libunwind8_1.6.2-3_amd64.deb /var/cache/apt/archives/strace_6.1-0.1_amd64.deb E: Sub-process /usr/bin/dpkg returned an error code (1) E: Problem executing scripts DPkg::Post-Invoke 'rm -f /var/cache/apt/archives/*.deb /var/cache/apt/archives/partial/*.deb /var/cache/apt/*.bin || true' E: Sub-process returned an error code
However, this is the latest call of Debian:BookWorm that is an older version of Docker and does not support it But it was solved by modifying the configuration, and at run is added:
--security-opt seccomp=unconfined
Your error is from a program inside the container (apparently due to seccomp, or corrupted download, packaging or memory issue), this issue is about an error in docker itself.
Is anybody working on the docker team looking at this thread or is this just an echo chamber?
@henryborchers I'm not sure. For better visibility I also opened the issue here. But there is also no activity aside from some tags assigned by @thaJeztah like in this issue.
For reference: It has been a few weeks and I have been monitoring my "ill" server that has had that problem. It has swarm enabled and I have been tweaking with my docker compose network sections. I have a standard network for each compose which is named "internal" where only the services of that specific compose file are connected. I also named those "internal". These are deployed in different stacks, however they have all had the same name. Docker never wrote any error when deploying the stacks and to my fault I did not notice that there was 1 internal network for all compose files and not many. I think that docker has had some trouble creating that network in the background which led to a recreation every time I deployed one of those stacks, although there was already a network named "internal". I noticed that a few weeks ago and gave all internal networks a new name. I also did not update docker for a few days just to be sure it came from the wrong network settings. Since then I haven't had the error.
I hope this helps anyone.
For reference: It has been a few weeks and I have been monitoring my "ill" server that has had that problem. It has swarm enabled and I have been tweaking with my docker compose network sections. I have a standard network for each compose which is named "internal" where only the services of that specific compose file are connected. I also named those "internal". These are deployed in different stacks, however they have all had the same name. Docker never wrote any error when deploying the stacks and to my fault I did not notice that there was 1 internal network for all compose files and not many. I think that docker has had some trouble creating that network in the background which led to a recreation every time I deployed one of those stacks, although there was already a network named "internal". I noticed that a few weeks ago and gave all internal networks a new name. I also did not update docker for a few days just to be sure it came from the wrong network settings. Since then I haven't had the error.
I hope this helps anyone.
Interesting finding. We do not specify any networks, but also we have the error during the build so that shouldn't make a difference. Seems like there are multiple sources for this error.
I am trying to debug this issue as well after we started seeing this on servers upgraded from Ubuntu 20 (5.4.0-105-generic) to Ubuntu 22 (5.15.0-86-generic). We have not noticed this error on the old machines, but suddenly we see it in multiple places where the common denominator is the newer OS.
The error is basically what has been mentioned before:
Error response from daemon: failed to create endpoint container.name on network bridge: failed to add the host (veth6bff3fa) <=> sandbox (vethed97567) pair interfaces: cannot allocate memory
and looking in the journal we see messages from the kernel like this:
From just doing a quick glance on what is going on here it appears like the kernel is unable to allocate 128 KiB of contiguous memory as requested. We have >8GB of RAM free, but apparently this does not mean anything since it can be chopped up into many smaller pieces which makes this call unable to complete.
Doing some quick reading I tested to execute the following two commands (as the root user):
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
and this appears to at help with the issue. However, on our high workload server this only lasted for a couple of minutes before returning again, while the low workload server has gone >24 hours without reporting this issue.
Throwing this into the conversation in hope that we manage to get further in understanding what goes wrong.
@JonasAlfredsson If this is the case than there is a clear connection to the already linked haskell issue, which refers to this kernel fix:
mas_empty_area() was incorrectly returning an error when there was room.
The maple tree was only introduced recently, so it makes sense that the problem did not occur on an older kernel.
We for example use 6.2.0-azure, which does not have the fix, but the most recent 6.1.57 has it. So you have to be very careful which version you use, because some distros have the opinion, that you only need the security fixes and other bugs are not important...
I can also report that this doesn't appear to happen on an Arch Linux machine which is starting a lot of containers for CI jobs (which I keep in my production enviroment for things like this), while the Debian Bookworm next to it experiences this all the time. That would also support that patch as an underlying issue.
Bookworm is at 6.1.55 right now, so I guess I'll wait for the next kernel update and see if the problem disappears?
We updated our Ubuntu 22.04 Servers to kernel 6.5 (New from the Ubuntu 23.10 release). This kernel has the above mentioned patch and so far the error did not appear anymore. I will report back when we see the error again.
Got a comment from a person that is more familiar with the kernel than me that the page allocation failure: order:5
seemed very large for the system we have, and suggested explicitly setting the nr_cpus
kernel parameter at boot since this apparently have an effect on how much memory the system allocates for this call.
We have a VM with 12 cores exposed to it (while the physical host has many more), so I did the following:
echo 'GRUB_CMDLINE_LINUX_DEFAULT="$GRUB_CMDLINE_LINUX_DEFAULT nr_cpus=12"' > /etc/default/grub.d/cpu-limit.cfg
update-grub
reboot
and while the reboot definitely helped reset the system to a non-fragmented state, we have seen 36 hours of no errors. I will also update this post in case we see the error on this host again.
We haven't got any errors since updating our kernel to 6.5, so i think this resolves the issue.
We updated our Ubuntu 22.04 Servers to kernel 6.5 (New from the Ubuntu 23.10 release). This kernel has the above mentioned patch and so far the error did not appear anymore. I will report back when we see the error again.
We just did the same a few weeks ago, but updating the kernel did not fix our problem. Was there anything else you did after the kernel update?
We updated our Ubuntu 22.04 Servers to kernel 6.5 (New from the Ubuntu 23.10 release). This kernel has the above mentioned patch and so far the error did not appear anymore. I will report back when we see the error again.
We just did the same a few weeks ago, but updating the kernel did not fix our problem. Was there anything else you did after the kernel update?
Nothing I'm aware of. Before that we also tried different things which were suggested here (mostly playing around with swap), but this did not have a direct effect. Now swap is enabled and vm.swappiness = 60
Kernel in use:
Linux ****** 6.5.0-1004-oem #4-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 15 19:52:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
@CoyoteWAN Today, one month later, the we had the same error again.
We applied the nr_cpus
workaround suggested by @JonasAlfredsson two weeks ago and have not seen the error since then. We were already running kernel 6.5 before, but that did not help in our case.
Just checked the collected logs from the two servers I experimented on:
nr_cpus
patch => 28 days without the errorWill apply the patch to the other servers now and see what happens.
We applied the workaround suggested by @JonasAlfredsson two weeks ago and have not seen the error since then. We were already running kernel 6.5 before, but that did not help in our case.
We have had similar results on RedHat 8.9 systems with this total kludge in cron:
*/5 * * * * sh -c 'echo 3 > /proc/sys/vm/drop_caches; echo 1 > /proc/sys/vm/compact_memory'
Update 2024-01-08: we still see sporadic failures but only during peak activity when we have a bunch of GitLab CI scheduled builds launching at the same time which keeps every CI runner fully loaded. It's still infrequent then so the workaround is clearly having some impact but is not a solution.
I'm wondering; does anyone in this thread tried to report this issue to kernel devs or to their distro? This issue isn't actionable on our side until we're able to reproduce it, and AFAICT we're not; so there's little we can do at this point.
I've been seeing this issue for some time too... I've tried tweaking swappiness and dropping caches, but it's never helped for long, or only improves the chances of things working - the only thing I've found that resolves this is a full reboot, and then it's just a matter of time until it happens again... I think I've tried restarting the docker daemon (and all containers), but I don't remember, so will give that a go this time.
My last boot was 2023-12-20, and it started occurring again on 2024-01-12... the probability appears to be zero for a while, then starts to increase over time, until docker is virtually unusable, and I'm forced to reboot.
As for reproducing it - I don't think an idle system will show this, but rather a system that creates / destroys many containers will probably work its way to this point over time (or quite possibly veth
devices specifically).
From a quick look at metrics recorded by Telegraf, nothing in particular stands out - though I did notice a large number of defunct / zombie processes, so we'll see if dealing with them gives my system a new lease of life... π€ (I'm not hopeful)
The output in dmesg
I see is similar to @JonasAlfredsson's (here):
OS: Red Hat Enterprise Linux release 8.7 (Ootpa) Version:
Out of hundreds os docker calls made over days, a few of them fails. This is the schema of the commandline:
The failure:
It is not easily reproducible. The failure rate is less than one percent. At the time this error happens system has lots of free memory. Around the time that this failure happens, the application is making around 5 docker calls per second. Each call take about 5 to 10 seconds to complete.