Closed olljanat closed 1 year ago
Should be solved on v1.9.4-rc1 but needs more testing.
I'm not seeing this dhcpd issue so I don't think I could verify a fix.
As a sidebar - what are you using for host/process monitoring with Burmilla?
I'm not seeing this dhcpd issue so I don't think I could verify a fix.
Yea that is tricky part as we see it on multiple servers on but not on all of them so need to run new RC version couple of weeks on some of those problematic ones to be sure.
As a sidebar - what are you using for host/process monitoring with Burmilla?
That picture is from Dynatrace. Deployed as container like described on https://www.dynatrace.com/support/help/setup-and-configuration/setup-on-container-platforms/docker/set-up-dynatrace-oneagent-as-docker-container#run-oneagent-as-a-docker-container
BurmillaOS is unsupported by Dynatrace but looks to be working fine.
I'm not seeing this dhcpd issue so I don't think I could verify a fix.
Yea that is tricky part as we see it on multiple servers on but not on all of them so need to run new RC version couple of weeks on some of those problematic ones to be sure.
As a sidebar - what are you using for host/process monitoring with Burmilla?
That picture is from Dynatrace. Deployed as container like described on https://www.dynatrace.com/support/help/setup-and-configuration/setup-on-container-platforms/docker/set-up-dynatrace-oneagent-as-docker-container#run-oneagent-as-a-docker-container
BurmillaOS is unsupported by Dynatrace but looks to be working fine.
Thanks - that looks similar to how the Elastic Beats and Telegraf agent containers work - wasn't sure if something like that should be running as a system service or if there was some better way to manage those super privileged containers.
Thanks - that looks similar to how the Elastic Beats and Telegraf agent containers work - wasn't sure if something like that should be running as a system service or if there was some better way to manage those super privileged containers.
On theory optimal solution would be running system-docker containers but as it runs inside of initrd any of the monitoring would not works without heavy modifications.
Also as we use Debian console now it is possible to install services inside of it also if needed. Like example iscsid actually need to run for those of us who need it.
Btw. I just found this issue which might affect our new rc version https://github.com/moby/moby/issues/43262
Cool. Both new Docker v20.10.13 (which based on release notes fixed at least some OOM issue) and new LTS version 2022.02 of buildroot looks to be released today so I will prepare 1.9.4 version based on those.
We see more servers appearing where this issue exist. Most probably it have something to do with dhcpcd log size, etc.
I find out that issue happens on servers where a lot of containers are coming and going. I used this Docker Stack on both v1.9.3 and 1.9.5-rc1:
version: "3.4"
services:
alpine:
image: alpine
command: sleep 30s
deploy:
mode: replicated
replicas: 10
Unfortunately it looks that issue happens still on 1.9.5-rc1 also (maybe situation is little bit less bad but still). However new things which I noticed was that if I use more aggressive settings like 1s sleep and 100 replicas then dhcpcd
start using also a lot of CPU so it is definitely listening also DHCP requests from containers which it shouldn't do.
So I will try configuration proposed on here https://unix.stackexchange.com/a/634852 next.
Extending cloud-init config with this one (sudo ros config merge -i memlimit.yml
) looks to working workaround which can be deployed to all to existing servers:
rancher:
services:
network:
restart: always
mem_limit: 20971520
We have a hardware host with v1.9.5 where the network container permanently runs out of memory. If the host is idle, the network container has a memory usage of 18MB. I had to change the memory limit from 20MB to 30MB, to avoid the network container permanently restarts.
I have already set in the network container /etc/dhcpcd.conf denyinterfaces veth* eth1 eth2 eth3
,
to exclude the docker interfaces and not connected hardware interfaces (we use only eth0),
but after a network container restart, the memory usage is still 18MB.
Anything I can do to debug this? The container logs don't show any helpful messages.
@netsandbox how long network container stays running when memory limit is 30 MB? 20 MB was just randomly selected number so might be that it is too tight limit.
Anything I can do to debug this?
Not easily. However I see that there is quite many commits in dhcpcd after 9.4.1 version release https://github.com/NetworkConfiguration/dhcpcd/compare/dhcpcd-9.4.1...master and at least two of those refers memory leak.
We get dhcpcd from buildroot https://github.com/buildroot/buildroot/blob/e644e5df39c4d63ce7ae28ce2d02bfbf2a230cff/package/dhcpcd/dhcpcd.mk#L7
So we probably should try build dhcpcd from latest version on their repo and if that looks fixing issue then request them to release new version and that it gets updated to buildroot.
When I had a look this morning on the host, I saw that there still where network container restarts in the middle of the night. So I now increased the memory limit from 30MB to 50MB.
We have planned for tomorrow to upgrade the host from v1.9.5 to v1.9.6. Both versions still uses the same dhcpd version, but maybe the memory problem is related to a kernel library which is used for our network interfaces. I will have an eye on the memory usage after the upgrade and then report back here.
I think that this is actually same bug than https://github.com/NetworkConfiguration/dhcpcd/issues/157 which is already fixed and plan looks to be that new dhcpcd version will be released after https://github.com/NetworkConfiguration/dhcpcd/issues/149 is fixed.
However os-base build tooling made by Rancher look supporting patches so I managed to build new version of dhcpcd where that single patch is included with https://github.com/burmilla/os-base/blob/c810a8a2c1818ed36bfe4e8b625c3ad7d497026d/patches/dhcpcd-9.4.1-with-405507a.patch
That is now included to just released v2.0.0-beta6
In additionally you can update network container to existing v1.9.6 installation by running these commands:
sudo system-docker pull burmilla/os-base:v1.9.6-dhcpcd-patched1
sudo ros config set rancher.services.network.image burmilla/os-base:v1.9.6-dhcpcd-patched1
and rebooting. But take backup/snapshot of server first and make sure that image was pulled suggesfully before second command. Other why console will not appear on next boot at all.
After setting network
container memory limit to 50MB we see no container restarts in the last 2 weeks.
I saw that you increased the limit for v1.9.7-rc1 to 100MB, which looks reasonable. Thanks!
Regarding the network
container memory usage increase, in the last 2 weeks the usage increased on one day from 27.24MiB to 27.31MiB and then stays stable at this value. So from here I don't see anything that looks like a memory leak.
But I have to admit that I don't know how many container starts and stops happened during this time, because we currently have no monitoring for this in place.
We are seeing very high dhcpd memory usage on our environment with multiple Burmilla nodes:
Burmilla v1.9.3 uses dhcpcd v9.4.0 and there is later version 9.4.1 available. Difference can be seen from https://github.com/NetworkConfiguration/dhcpcd/compare/dhcpcd-9.4.0...dhcpcd-9.4.1 with quick look it sounds that issue would be already fixed on https://github.com/NetworkConfiguration/dhcpcd/commit/ba9f3823ae825c341ea30f45b46d942b4ce5b8d9