Closed plkokanov closed 11 months ago
I noticed that there also appears to be a problem with the syslog.socket
when this issue occurs:
root@machine-shoot--local--local-local-7d9f7-6t74v:/# systemctl status syslog
● rsyslog.service - System Logging Service
Loaded: loaded (/lib/systemd/system/rsyslog.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Tue 2023-11-07 18:33:34 UTC; 4min 1s ago
TriggeredBy: ● syslog.socket
Docs: man:rsyslogd(8)
man:rsyslog.conf(5)
https://www.rsyslog.com/doc/
Main PID: 62884
Tasks: 5 (limit: 3585)
Memory: 1.9M
CPU: 46ms
CGroup: /system.slice/rsyslog.service
└─62884 /usr/sbin/rsyslogd -n -iNONE
Nov 07 18:33:34 machine-shoot--local--local-local-7d9f7-6t74v systemd[1]: Starting System Logging Service...
Nov 07 18:33:35 machine-shoot--local--local-local-7d9f7-6t74v systemd[1]: Started System Logging Service.
root@machine-shoot--local--local-local-7d9f7-6t74v:/# journalctl -xe -u syslog.socket
-- Journal begins at Tue 2023-11-07 17:57:10 UTC, ends at Tue 2023-11-07 18:38:54 UTC. --
Nov 07 18:33:32 machine-shoot--local--local-local-7d9f7-6t74v systemd[1]: syslog.socket: Socket service syslog.service not loaded, refusing.
Nov 07 18:33:32 machine-shoot--local--local-local-7d9f7-6t74v systemd[1]: Failed to listen on Syslog Socket.
I have identified the reason for the error. Syslog logging on linux is comprised of a couple of components. The ones which are relevant for this issue are the following systemd units:
syslog.service
- this is a symlink in /etc/systemd/system/syslog.service
which points to the current implementation of the syslog daemon. In our case that is /lib/systemd/system/rsyslog.service
ls -la /etc/systemd/system/syslog.service
lrwxrwxrwx 1 root root 35 Nov 8 13:20 /etc/systemd/system/syslog.service -> /lib/systemd/system/rsyslog.service
syslog.socket
- this unit takes care of activating the syslog.service
whenever it is required. By activating syslog.service
rsyslog.service
- this is the actual implementation of thee syslog daemon and will get activated by syslog.socket
when it activates syslog.service
On the ubuntu OSes used for the shoot nodes in the E2E
tests syslog.socket
is not active by default and rsyslog.service
and syslog.service
do not exist. These get installed by the test code here: https://github.com/gardener/gardener-extension-shoot-rsyslog-relp/blob/36e7cd20441974b56a85dd89804fe42fdd772942/test/e2e/create_enabled_hibernate_reconcile_delete_test.go#L35-L36
The rsyslog.service
unit is installed in a deactivated state before the syslog.service
symlink is created. This means that there is a chance for the following script to run before syslog.service
exists: https://github.com/gardener/gardener-extension-shoot-rsyslog-relp/blob/36e7cd20441974b56a85dd89804fe42fdd772942/charts/internal/rsyslog-relp-configurator/templates/configmap.yaml#L48
This will start rsyslog.service
, however it will also start the syslog.socket
unit, which errors when syslog.service
is missing. It then reaches the maximum number of restarts and remains in na failed state. An additional restart either of rsyslog.service
or syslog.socket
after the syslog.service
symlink is created is required to fix the problems.
One way to solve this issue would be to wait for the syslog.service
symlink to exist before configuring and restart rsyslog.service
, or to continuously attempt to start rsyslog.service
if it is not currently running.
How to categorize this issue?
/area testing /kind flake
Which test(s)/suite(s) are flaking: E2E test job
pull-gardener-extension-shoot-rsyslog-relp-e2e-kind
is failing with the following error:More concretely the flake is in the
Create Shoot with shoot-rsyslog-relp extension enabled, hibernate Shoot, reconcile Shoot, wake up Shoot, delete Shoot
test.CI link: https://prow.gardener.cloud/view/gs/gardener-prow/pr-logs/pull/gardener_gardener-extension-shoot-rsyslog-relp/24/pull-gardener-extension-shoot-rsyslog-relp-e2e-kind/1714922825560625152
Reason for failure: It looks like
rsyslog
fails to send logs to thersyslog-relp-echo-server
when theshoot-rsyslog-relp
extension is already configured on a shoot cluster. However, the exact reason for this is unknown.Anything else we need to know: A previous flake for this test was fixed in #11. Howeverr, this seems unrelated. After reproducing this flake locally and
exec
ing into the node I found the following logs for rsyslog:It seems that
rsyslog
is stuck in a restart loop and after not being able to start for 8 times it gives up and remains stopped.