gardener / gardener-extension-shoot-rsyslog-relp

Other
0 stars 15 forks source link

[Flaky Test] E2E test `pull-gardener-extension-shoot-rsyslog-relp-e2e-kind` is flaky #26

Closed plkokanov closed 11 months ago

plkokanov commented 1 year ago

How to categorize this issue?

/area testing /kind flake

Which test(s)/suite(s) are flaking: E2E test job pull-gardener-extension-shoot-rsyslog-relp-e2e-kind is failing with the following error:

  The function passed to Eventually failed at /home/prow/go/src/github.com/gardener/gardener-extension-shoot-rsyslog-relp/test/e2e/test_common.go:87 with:
  Expected
      <[]string | len:0, cap:0>: nil
  to contain element matching
      <*matchers.MatchRegexpMatcher | 0xc00196c090>: {
          Regexp: " local e2e-rslog-hib 3cd953d3-d513-4ed0-90f7-2c4cce4eaf67 .* test-program\\[\\d+\\]: .* this should get sent to echo server",
          Args: nil,
      }

More concretely the flake is in the Create Shoot with shoot-rsyslog-relp extension enabled, hibernate Shoot, reconcile Shoot, wake up Shoot, delete Shoot test.

CI link: https://prow.gardener.cloud/view/gs/gardener-prow/pr-logs/pull/gardener_gardener-extension-shoot-rsyslog-relp/24/pull-gardener-extension-shoot-rsyslog-relp-e2e-kind/1714922825560625152

Reason for failure: It looks like rsyslog fails to send logs to the rsyslog-relp-echo-server when the shoot-rsyslog-relp extension is already configured on a shoot cluster. However, the exact reason for this is unknown.

Anything else we need to know: A previous flake for this test was fixed in #11. Howeverr, this seems unrelated. After reproducing this flake locally and execing into the node I found the following logs for rsyslog:

Nov 01 13:47:51 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Main process exited, code=exited, status=1/FAILURE
Nov 01 13:47:51 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Failed with result 'exit-code'.
Nov 01 13:47:51 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Failed to start System Logging Service.
Nov 01 13:47:51 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Scheduled restart job, restart counter is at 1.
Nov 01 13:47:51 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Stopped System Logging Service.
Nov 01 13:47:51 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Starting System Logging Service...
Nov 01 13:47:51 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Main process exited, code=exited, status=1/FAILURE
Nov 01 13:47:51 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Failed with result 'exit-code'.
Nov 01 13:47:51 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Failed to start System Logging Service.
Nov 01 13:47:51 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Scheduled restart job, restart counter is at 2.
Nov 01 13:47:51 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Stopped System Logging Service.
Nov 01 13:47:51 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Starting System Logging Service...
Nov 01 13:47:52 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Main process exited, code=exited, status=1/FAILURE
Nov 01 13:47:52 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Failed with result 'exit-code'.
Nov 01 13:47:52 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Failed to start System Logging Service.
Nov 01 13:47:52 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Scheduled restart job, restart counter is at 3.
Nov 01 13:47:52 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Stopped System Logging Service.
Nov 01 13:47:52 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Starting System Logging Service...
Nov 01 13:47:52 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Main process exited, code=exited, status=1/FAILURE
Nov 01 13:47:52 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Failed with result 'exit-code'.
Nov 01 13:47:52 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Failed to start System Logging Service.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Scheduled restart job, restart counter is at 4.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Stopped System Logging Service.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Starting System Logging Service...
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Main process exited, code=exited, status=1/FAILURE
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Failed with result 'exit-code'.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Failed to start System Logging Service.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Scheduled restart job, restart counter is at 5.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Stopped System Logging Service.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Starting System Logging Service...
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Main process exited, code=exited, status=1/FAILURE
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Failed with result 'exit-code'.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Failed to start System Logging Service.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Scheduled restart job, restart counter is at 6.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Stopped System Logging Service.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Starting System Logging Service...
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Main process exited, code=exited, status=1/FAILURE
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Failed with result 'exit-code'.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Failed to start System Logging Service.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Scheduled restart job, restart counter is at 7.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Stopped System Logging Service.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Starting System Logging Service...
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Main process exited, code=exited, status=1/FAILURE
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Failed with result 'exit-code'.
Nov 01 13:47:53 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Failed to start System Logging Service.
Nov 01 13:47:54 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Scheduled restart job, restart counter is at 8.
Nov 01 13:47:54 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Stopped System Logging Service.
Nov 01 13:47:54 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Start request repeated too quickly.
Nov 01 13:47:54 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: rsyslog.service: Failed with result 'exit-code'.
Nov 01 13:47:54 machine-shoot--local--e2e-rslog-hib-local-f5fd4-lpz5s systemd[1]: Failed to start System Logging Service.

It seems that rsyslog is stuck in a restart loop and after not being able to start for 8 times it gives up and remains stopped.

plkokanov commented 1 year ago

I noticed that there also appears to be a problem with the syslog.socket when this issue occurs:

root@machine-shoot--local--local-local-7d9f7-6t74v:/# systemctl status syslog
● rsyslog.service - System Logging Service
     Loaded: loaded (/lib/systemd/system/rsyslog.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Tue 2023-11-07 18:33:34 UTC; 4min 1s ago
TriggeredBy: ● syslog.socket
       Docs: man:rsyslogd(8)
             man:rsyslog.conf(5)
             https://www.rsyslog.com/doc/
   Main PID: 62884
      Tasks: 5 (limit: 3585)
     Memory: 1.9M
        CPU: 46ms
     CGroup: /system.slice/rsyslog.service
             └─62884 /usr/sbin/rsyslogd -n -iNONE

Nov 07 18:33:34 machine-shoot--local--local-local-7d9f7-6t74v systemd[1]: Starting System Logging Service...
Nov 07 18:33:35 machine-shoot--local--local-local-7d9f7-6t74v systemd[1]: Started System Logging Service.
root@machine-shoot--local--local-local-7d9f7-6t74v:/# journalctl -xe -u syslog.socket
-- Journal begins at Tue 2023-11-07 17:57:10 UTC, ends at Tue 2023-11-07 18:38:54 UTC. --
Nov 07 18:33:32 machine-shoot--local--local-local-7d9f7-6t74v systemd[1]: syslog.socket: Socket service syslog.service not loaded, refusing.
Nov 07 18:33:32 machine-shoot--local--local-local-7d9f7-6t74v systemd[1]: Failed to listen on Syslog Socket.
plkokanov commented 1 year ago

I have identified the reason for the error. Syslog logging on linux is comprised of a couple of components. The ones which are relevant for this issue are the following systemd units:

On the ubuntu OSes used for the shoot nodes in the E2E tests syslog.socket is not active by default and rsyslog.service and syslog.service do not exist. These get installed by the test code here: https://github.com/gardener/gardener-extension-shoot-rsyslog-relp/blob/36e7cd20441974b56a85dd89804fe42fdd772942/test/e2e/create_enabled_hibernate_reconcile_delete_test.go#L35-L36 The rsyslog.service unit is installed in a deactivated state before the syslog.service symlink is created. This means that there is a chance for the following script to run before syslog.service exists: https://github.com/gardener/gardener-extension-shoot-rsyslog-relp/blob/36e7cd20441974b56a85dd89804fe42fdd772942/charts/internal/rsyslog-relp-configurator/templates/configmap.yaml#L48 This will start rsyslog.service, however it will also start the syslog.socket unit, which errors when syslog.service is missing. It then reaches the maximum number of restarts and remains in na failed state. An additional restart either of rsyslog.service or syslog.socket after the syslog.service symlink is created is required to fix the problems.

One way to solve this issue would be to wait for the syslog.service symlink to exist before configuring and restart rsyslog.service, or to continuously attempt to start rsyslog.service if it is not currently running.