coreos / fedora-coreos-pinger

Telemetry service to be used in Fedora CoreOS (see https://github.com/coreos/fedora-coreos-tracker/issues/86)
Apache License 2.0
15 stars 7 forks source link

systemd: add persistent state and report with different intervals #35

Closed zonggen closed 4 years ago

zonggen commented 4 years ago

Add StateDirecory=fedora-coreos-pinger to the systemd file and store different timestamps in the persistent files under /var/lib/fedora-coreos-pinger, for example (timestamp_daily and timestamp_monthly). Report when intervals between timestamps are more than one day or one month.

Signed-off-by: Allen Bai abai@redhat.com

zonggen commented 4 years ago

Tested by overwriting the systemd file with /overrides/rootfs/lib/systemd/system/fedora-coreos-pinger.service before cosa build && cosa run.

Got denied permission error on startup:

$ systemctl status fedora-coreos-pinger.service

Oct 22 20:08:15 coreos systemd[1]: Starting Telemetry service for Fedora CoreOS...
Oct 22 20:08:15 coreos systemd[1134]: fedora-coreos-pinger.service: Failed to execute command: No such file or directory
Oct 22 20:08:15 coreos systemd[1134]: fedora-coreos-pinger.service: Failed at step EXEC spawning /usr/libexec/fedora-coreos-pinger: No such file or directory
Oct 22 20:08:15 coreos systemd[1]: fedora-coreos-pinger.service: Main process exited, code=exited, status=203/EXEC
Oct 22 20:08:15 coreos systemd[1]: fedora-coreos-pinger.service: Failed with result 'exit-code'.
Oct 22 20:08:15 coreos systemd[1]: Failed to start Telemetry service for Fedora CoreOS.

EDIT: Seems like it is not the systemd causing the problem since overriding with the original systemd file will have the same error.

Investigating ...

SOLVED: Caused by missing /overrides/rootfs/usr/libexec/fedora-coreos-pinger when doing cosa build.

zonggen commented 4 years ago

Currently the program runs without producing any error, however the user@981.service keeps failing for me with the following message:

[core@coreos ~]$ sudo systemctl status user@981.service 
● user@981.service - User Manager for UID 981
   Loaded: loaded (/usr/lib/systemd/system/user@.service; static; vendor preset: disabled)
   Active: failed (Result: protocol) since Fri 2019-10-25 19:11:22 UTC; 702ms ago
     Docs: man:user@.service(5)
  Process: 1367 ExecStart=/usr/lib/systemd/systemd --user (code=exited, status=224/PAM)
 Main PID: 1367 (code=exited, status=224/PAM)

Oct 25 19:11:22 coreos systemd[1]: Starting User Manager for UID 981...
Oct 25 19:11:22 coreos systemd[1367]: pam_unix(systemd-user:account): account fedora-coreos-pinger has expired (account expired)
Oct 25 19:11:22 coreos systemd[1367]: PAM failed: User account has expired
Oct 25 19:11:22 coreos systemd[1367]: user@981.service: Failed to set up PAM session: Operation not permitted
Oct 25 19:11:22 coreos systemd[1367]: user@981.service: Failed at step PAM spawning /usr/lib/systemd/systemd: Operation not permitted
Oct 25 19:11:22 coreos systemd[1]: user@981.service: Failed with result 'protocol'.
Oct 25 19:11:22 coreos systemd[1]: Failed to start User Manager for UID 981.
Oct 25 19:11:22 coreos systemd[1]: Dependency failed for User Manager for UID 981.
Oct 25 19:11:22 coreos systemd[1]: user@981.service: Job user@981.service/start failed with result 'dependency'.

This might because the parent process in the main function spawn a child process and exit immediately in order to make pinger a Daemon process. And accessing the account after the parent process exits might cause the user@981.service to fail. Insights from other people would be helpful..

jlebon commented 4 years ago

There's no need to fork here; we're not starting the pinger from a terminal, but via systemd. The Type=forking is more of a legacy thing for daemons that still do this.

jlebon commented 4 years ago

This has some good information: http://0pointer.de./public/systemd-man/daemon.html

zonggen commented 4 years ago

@jlebon Thanks for the link! I originally thought both old and new are spawned by systemd. Testing without forking..

zonggen commented 4 years ago

user@981.service still failing with

[core@coreos ~]$ sudo systemctl status user@981.service
● user@981.service - User Manager for UID 981
   Loaded: loaded (/usr/lib/systemd/system/user@.service; static; vendor preset: disabled)
   Active: failed (Result: protocol) since Wed 2019-10-30 14:00:26 UTC; 944ms ago
     Docs: man:user@.service(5)
  Process: 1365 ExecStart=/usr/lib/systemd/systemd --user (code=exited, status=224/PAM)
 Main PID: 1365 (code=exited, status=224/PAM)

Oct 30 14:00:26 coreos systemd[1365]: user@981.service: Failed to set up PAM session: Operation not permitted
Oct 30 14:00:26 coreos systemd[1365]: user@981.service: Failed at step PAM spawning /usr/lib/systemd/systemd: Operation not permitted
Oct 30 14:00:26 coreos systemd[1]: user@981.service: Failed with result 'protocol'.
Oct 30 14:00:26 coreos systemd[1]: Failed to start User Manager for UID 981.
Oct 30 14:00:26 coreos systemd[1]: Dependency failed for User Manager for UID 981.
Oct 30 14:00:26 coreos systemd[1]: user@981.service: Job user@981.service/start failed with result 'dependency'.
Oct 30 14:00:26 coreos systemd[1]: Dependency failed for User Manager for UID 981.
Oct 30 14:00:26 coreos systemd[1]: user@981.service: Job user@981.service/start failed with result 'dependency'.
Oct 30 14:00:26 coreos systemd[1]: Dependency failed for User Manager for UID 981.
Oct 30 14:00:26 coreos systemd[1]: user@981.service: Job user@981.service/start failed with result 'dependency'.
[core@coreos ~]$  systemctl list-units --state=failed
  UNIT                         LOAD   ACTIVE SUB    DESCRIPTION                         
● user-runtime-dir@981.service loaded failed failed User Runtime Directory /run/user/981
● user@981.service             loaded failed failed User Manager for UID 981            

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

2 loaded units listed.

Investigating..

EDIT: It turns out that calls to container_runtime.rs module would cause the failure, further investigating ..

EDIT2: Found the issue here, running podman container ls would give error: Error: could not get runtime: cannot make directory /.config/containers: mkdir /.config: read-only file system. I think what happened is that podman was trying to create a temporary directory but failed since the root directory is read-only.

zonggen commented 4 years ago

Container runtime issue resolved by calling pgrep instead of individual container runtimes. Also, keep using Type=oneshot and RemainAfterExit=yes instead of forking.

zonggen commented 4 years ago

Tested in a local FCOS machine, ready for reviews.

EDIT: Rebooting the machine would cause the pinger service fail with:

[core@coreos ~]$ sudo systemctl status fedora-coreos-pinger.service 
● fedora-coreos-pinger.service - Telemetry service for Fedora CoreOS
   Loaded: loaded (/usr/lib/systemd/system/fedora-coreos-pinger.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2019-11-25 17:15:35 UTC; 1min 27s ago
     Docs: https://github.com/coreos/fedora-coreos-pinger
  Process: 818 ExecStart=/usr/libexec/fedora-coreos-pinger (code=exited, status=238/STATE_DIRECTORY)
 Main PID: 818 (code=exited, status=238/STATE_DIRECTORY)

Nov 25 17:15:35 coreos systemd[1]: Starting Telemetry service for Fedora CoreOS...
Nov 25 17:15:35 coreos systemd[818]: fedora-coreos-pinger.service: Failed to set up special execution directory in /var/lib: Permission denied
Nov 25 17:15:35 coreos systemd[818]: fedora-coreos-pinger.service: Failed at step STATE_DIRECTORY spawning /usr/libexec/fedora-coreos-pinger: Permission denied
Nov 25 17:15:35 coreos systemd[1]: fedora-coreos-pinger.service: Main process exited, code=exited, status=238/STATE_DIRECTORY
Nov 25 17:15:35 coreos systemd[1]: fedora-coreos-pinger.service: Failed with result 'exit-code'.
Nov 25 17:15:35 coreos systemd[1]: Failed to start Telemetry service for Fedora CoreOS.

Looks like the service failed to set up the state directory which belongs to the dynamic user. Found this issue when running tests with cosa kola, some of the tests would fail when some of the systemd units fail.

EDIT2: After investigation, it seems like systemd would try to recreate the persistent directory /var/lib/fedora-coreos-pinger every time it boots, and this would generate the permission error reported above. Now, I'm trying to find a way for systemd create the persistent directory only once and only recreate the directory when it is removed/not present.

zonggen commented 4 years ago

It turns out that systemd does not allow sharing persistent directories across different dynamic users, which is odd and might be a bug. However, using LogsDirectory= will work and therefore switching.

Since the data collections PR will dependent on this PR, changes are committed there (link:https://github.com/coreos/fedora-coreos-pinger/pull/31/commits/ecf60b24de55ccdad9ffc96ab89622708b160eb0)

zonggen commented 4 years ago

This PR should not be merged and only stays as a reference, since works are merged with the data collection PR. Can (and probably should) be closed.

zonggen commented 4 years ago

Closing