Open Rid opened 8 months ago
Perhaps if IO is overloaded you could just discard the output, that would be better than a hard exit, for example:
def read_log_files(self, units):
BUFSIZE = 8192
for unit in units:
if unit in self._log_file:
new_text = b""
while True:
buf = os.read(self._log_file[unit], BUFSIZE)
if not buf:
break
new_text += buf
text = self._log_hold[unit] + new_text
if not text:
continue
lines = text.split(b"\n")
if not text.endswith(b"\n"):
self._log_hold[unit] = lines[-1]
lines = lines[:-1]
for line in lines:
prefix = unit.encode("utf-8")
content = prefix + b": " + line + b"\n"
try:
os.write(1, content)
try:
os.fsync(1)
except Exception:
pass
except BlockingIOError:
pass
Sorry, looking at this a bit further it seems that systemctl might be contributing to the high IO. I applied the above patch so that it's no longer exiting, but where there are failing servicehelpers it seems to be producing excessive log outputs such as the below.
Unfortunately we do not have full control over the services inside the container as it's used much like a VPS (with strict security policies).
Should there be a back-off and then eventual failure for these servicehelpers? or can we rate-limit them?
strace: Process 3231129 attached
read(3, " [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15262' (uid=0 pid=386450 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15262' (uid=0 pid=386450 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15263' (uid=0 pid=386699 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15263' (u"..., 8192) = 8192
mmap(NULL, 75100160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7a5106800000
munmap(0x7a510b000000, 75091968) = 0
read(3, "us-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15275' (uid=0 pid=389750 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15276' (uid=0 pid=389982 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15276' (uid=0 pid=389982 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested "..., 8192) = 8192
mmap(NULL, 75108352, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7a510f800000
munmap(0x7a5106800000, 75100160) = 0
read(3, "return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15289' (uid=0 pid=393387 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15289' (uid=0 pid=393387 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15290' (uid=0 pid=393652 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timed"..., 8192) = 8192
mmap(NULL, 75116544, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7a510b000000
munmap(0x7a510f800000, 75108352) = 0
read(3, "ed with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15302' (uid=0 pid=396348 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15303' (uid=0 pid=396611 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15303' (uid=0 pid=396611 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.f"..., 8192) = 8192
mmap(NULL, 75124736, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7a5106800000
munmap(0x7a510b000000, 75116544) = 0
read(3, "unch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15316' (uid=0 pid=399622 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15316' (uid=0 pid=399622 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15317' (uid=0 pid=399841 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating ser"..., 8192) = 8192
mmap(NULL, 75132928, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7a510f800000
munmap(0x7a5106800000, 75124736) = 0
read(3, "ate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15329' (uid=0 pid=402640 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15330' (uid=0 pid=402851 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15330' (uid=0 pid=402851 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system"..., 8192) = 8192
mmap(NULL, 75141120, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7a510b000000
munmap(0x7a510f800000, 75132928) = 0
read(3, "reedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15343' (uid=0 pid=405936 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15343' (uid=0 pid=405936 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15344' (uid=0 pid=406154 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemo"..., 8192) = 8192
mmap(NULL, 75149312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7a5106800000
munmap(0x7a510b000000, 75141120) = 0
read(3, "d service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15356' (uid=0 pid=408942 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15357' (uid=0 pid=409205 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return code 1\ndbus-daemon[1068]: [system] Activating service name='org.freedesktop.timedate1' requested by ':1.15357' (uid=0 pid=409205 comm=\"/usr/lib/snapd/snapd \") (using servicehelper)\ndbus-daemon[1068]: [system] Activated service 'org.freedesktop.timedate1' failed: Launch helper exited with unknown return c"..., 8192) = 8192
The proposed patch is correct anyway.
But what I see is a number of services that can not start in a container anyway. There are some "igno_" lists in the python-file that excludes some enabled units from being included in the init-list of modules. Do we need to extend that list?
Additionally, try to run the container as a command-shell and check what services are included in the init-list
systemctl3.py default-services
If you see anything that looks like coming from the original systemd then it should be kicked out.
Thanks, here's the output:
root@ubtvnc:/# systemctl default-services
appbox_dbus.service
cron.service
nginx.service
panel.service
php8.1-fpm.service
snapd.aa-prompt-listener.service
snapd.apparmor.service
snapd.autoimport.service
snapd.core-fixup.service
snapd.recovery-chooser-trigger.service
snapd.seeded.service
snapd.service
ssh.service
vnstat.service
cups.service
docker.service
fail2ban.service
plymouth.service
pulseaudio-enable-autospawn.service
rsync.service
unattended-upgrades.service
uuidd.service
Is snapd already included the in ignore list?
I never had containerized a snapd app. Probably kill the appboxdbus service as well as I am suspicious that it runs addtional services. May be you can test a bit with adding patterns to the igno-lists?
Snapd doesn't work in docker as it requires real systemd, so we can add
igno_always = ["network*", "dbus*", "systemd-*", "kdump*", "kmod*", "snapd*"]
However some really big journal logs have been created for example:
-rw-r--r--+ 1 root systemd-journal 92M Mar 21 09:37 snapd.service.log
It seems now like it's using 100% CPU reading these log files as I can see it reading errors which have been resolved. Sorry for my ignorance, I'm not a systemd expert, but what's the reason we're reading the whole log file?
EDIT: Testing this on a few other containers with the issue, deleting the log files reduces the CPU usage from 100% to 0.7% even when services are failing, so this seems like the real issue regarding the resource usage.
Setting the buffersize to 10MB resolves the issue:
def read_log_files(self, units):
BUFSIZE=10485760
Are logs ever rotated? I know on normal systemd systemd-journald rotates the journal files
EDIT: Looks like it's not implemented (https://github.com/gdraheim/docker-systemctl-replacement/issues/41) are there any work arounds for this other than running a custom service to just rotate the log files (we would need to add jitter as we run many containers per node)?
Logs are not rotated, you are right, and the read-log-loop is about duplicating the log lines from each application service into the container log (docker log), so that messages from the applications become visible outside the container.
The current implementation runs the service processes with the log files attached, so it is not really possible to just rotate them as systemd does.
There has been just an idea floating around to punch holes in the log files aka sparse files. It may be possible to analyse systemd-journald taking over some idea. But I did not try so far. In reality it was good enough to fix the problems on the applications themselves that led to too many output lines.
Thanks @gdraheim, so I could use fallocate
to in essence delete the logs from the offset (0) to the length? I could setup a service to do this once per day and it would suffice.
Well, that sounds like a reasonable workaround. I hope that one day, the systemctl initloop could throw away parts beyond a certain limit.
I am thinking to just extend the code in a fork to add a function to fallocate the log files if they're over 5MB (and truncate them to 5MB), I would add a last run variable to check if it has run during the current day, and if not schedule it for a random time.
it's probably too opinionated to be merged, but I can make a PR if you're interested.
We run this on multiple webtop containers (https://github.com/linuxserver/docker-webtop).
The containers can sometimes become IO constrained due to heavy IO. This causes systemctl3.py to exit with the following error:
Once it exits, it's child processes are moved to PID 1 (we use s6 as the PID 1 and run systemctl as an s6 service to handle systemd services for convenience).
systemctl3.py then enters an infinite loop of trying to start the services which are already running. They exit immediately due to PID file lock contention, and then systemctl3.py tries to start them again.
This infinite loop uses 100% CPU:
Is there any way we can remove the IO timeout? Having systemctl hang is much more preferable to an exit, as the contention will eventually subside and resources will become available again.