Make systemd unit file more reliable as a user service

pallaswept commented 10 months ago

TL;DR the service starts too soon, and retries too fast, always errors, and sometimes fails. This version, starts later, and retries slower, and succeeds first time every time.

Details:

I have noticed occasional failures of ydotool to start (looking at my logs, in the region of 1 time out of 10), and consistent (as in, always) generating errors before starting, as per the following journalctl logs:

-- Boot 95f07c2559b948719928e8685ad40332 --
Jan 22 10:06:42 localhost systemd[10668]: Started Starts ydotoold service.
Jan 22 10:06:42 localhost ydotoold[10683]: failed to open uinput device: Permission denied
Jan 22 10:06:42 localhost ydotoold[10683]: You're advised to run this program as root, or YMMV.
Jan 22 10:06:42 localhost systemd[10668]: ydotool.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 22 10:06:42 localhost systemd[10668]: ydotool.service: Failed with result 'exit-code'.
Jan 22 10:06:43 localhost systemd[10668]: ydotool.service: Scheduled restart job, restart counter is at 1.
Jan 22 10:06:43 localhost systemd[10668]: Started Starts ydotoold service.
Jan 22 10:06:43 localhost ydotoold[10687]: failed to open uinput device: Permission denied
Jan 22 10:06:43 localhost ydotoold[10687]: You're advised to run this program as root, or YMMV.
Jan 22 10:06:43 localhost systemd[10668]: ydotool.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 22 10:06:43 localhost systemd[10668]: ydotool.service: Failed with result 'exit-code'.
Jan 22 10:06:43 localhost systemd[10668]: ydotool.service: Scheduled restart job, restart counter is at 2.
Jan 22 10:06:43 localhost systemd[10668]: Started Starts ydotoold service.
Jan 22 10:06:43 localhost ydotoold[10694]: failed to open uinput device: Permission denied
Jan 22 10:06:43 localhost ydotoold[10694]: You're advised to run this program as root, or YMMV.
Jan 22 10:06:43 localhost systemd[10668]: ydotool.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 22 10:06:43 localhost systemd[10668]: ydotool.service: Failed with result 'exit-code'.
Jan 22 10:06:43 localhost systemd[10668]: ydotool.service: Scheduled restart job, restart counter is at 3.
Jan 22 10:06:43 localhost systemd[10668]: Started Starts ydotoold service.

A retry counter of 3 is about the usual, sometimes it would reach 4, sometimes it would pass the limit of 5 and all-out fail.

On the occasions where the service entirely failed, this would be the same, but reach the restart limit and fail:

-- Boot c0d11d526cb64ca288eaf51c51d6639c --
Jan 26 14:00:52 localhost systemd[10949]: Started Starts ydotoold service.
Jan 26 14:00:52 localhost ydotoold[10964]: failed to open uinput device: Permission denied
Jan 26 14:00:52 localhost ydotoold[10964]: You're advised to run this program as root, or YMMV.
Jan 26 14:00:52 localhost systemd[10949]: ydotool.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 26 14:00:52 localhost systemd[10949]: ydotool.service: Failed with result 'exit-code'.
Jan 26 14:00:52 localhost systemd[10949]: ydotool.service: Scheduled restart job, restart counter is at 1.
Jan 26 14:00:52 localhost systemd[10949]: Started Starts ydotoold service.
Jan 26 14:00:52 localhost ydotoold[10967]: failed to open uinput device: Permission denied
Jan 26 14:00:52 localhost ydotoold[10967]: You're advised to run this program as root, or YMMV.
Jan 26 14:00:52 localhost systemd[10949]: ydotool.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 26 14:00:52 localhost systemd[10949]: ydotool.service: Failed with result 'exit-code'.
Jan 26 14:00:52 localhost systemd[10949]: ydotool.service: Scheduled restart job, restart counter is at 2.
Jan 26 14:00:52 localhost systemd[10949]: Started Starts ydotoold service.
Jan 26 14:00:52 localhost ydotoold[10969]: failed to open uinput device: Permission denied
Jan 26 14:00:52 localhost ydotoold[10969]: You're advised to run this program as root, or YMMV.
Jan 26 14:00:52 localhost systemd[10949]: ydotool.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 26 14:00:52 localhost systemd[10949]: ydotool.service: Failed with result 'exit-code'.
Jan 26 14:00:53 localhost systemd[10949]: ydotool.service: Scheduled restart job, restart counter is at 3.
Jan 26 14:00:53 localhost systemd[10949]: Started Starts ydotoold service.
Jan 26 14:00:53 localhost ydotoold[10974]: failed to open uinput device: Permission denied
Jan 26 14:00:53 localhost ydotoold[10974]: You're advised to run this program as root, or YMMV.
Jan 26 14:00:53 localhost systemd[10949]: ydotool.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 26 14:00:53 localhost systemd[10949]: ydotool.service: Failed with result 'exit-code'.
Jan 26 14:00:53 localhost systemd[10949]: ydotool.service: Scheduled restart job, restart counter is at 4.
Jan 26 14:00:53 localhost systemd[10949]: Started Starts ydotoold service.
Jan 26 14:00:53 localhost ydotoold[10976]: failed to open uinput device: Permission denied
Jan 26 14:00:53 localhost ydotoold[10976]: You're advised to run this program as root, or YMMV.
Jan 26 14:00:53 localhost systemd[10949]: ydotool.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 26 14:00:53 localhost systemd[10949]: ydotool.service: Failed with result 'exit-code'.
Jan 26 14:00:53 localhost systemd[10949]: ydotool.service: Scheduled restart job, restart counter is at 5.
Jan 26 14:00:53 localhost systemd[10949]: ydotool.service: Start request repeated too quickly.
Jan 26 14:00:53 localhost systemd[10949]: ydotool.service: Failed with result 'exit-code'.
Jan 26 14:00:53 localhost systemd[10949]: Failed to start Starts ydotoold service.

Simply starting the service manually after boot when I realised it had failed, it would always start successfully, without error, immediately.

It was clear that the service was starting too soon, so my first port of call was to change the WantedBy target to something more appropriate. In order to determine the most appropriate candidate, I manually started the service after a failure, and employed systemd-analyze to find the chain which led to the successful start:

> systemd-analyze critical-chain --user ydotool.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

ydotool.service +2.038s
└─basic.target @152ms
  └─systemd-tmpfiles-setup.service @150ms +2ms
    └─app.slice @147ms
      └─-.slice @80ms

Accordingly, I changed the WantedBy to basic.target. This resulted in 100% successful starts, although it would sometimes have to retry 1 or 2 times, before starting. These are immediate retries (as per the above logs, they happen within the same second of time), so I took two measures to mitigate this:

Firstly, I added a 2 second delay before starting the service. 1 second would have been sufficient (refer to the timeframes in the above logs), but I am erring on the side of caution, for consideration of other people's PC's which may be slower. This was sufficient to result in 100% error-free startups.

To add an extra layer of resilience, I also extended the restart timer to 3 seconds, so that if there were a failure (not that I ever saw one, but other PCs may differ) it would be less likely to fail entirely after the 5 restarts (which would take a total of 15 seconds, an order of magnitude longer than any failure I ever saw). This should be more than enough to ensure a successful start, even in the case of an error, which should be unlikely anyway, given the modified target and delayed start.

Frankly, any one of these three measures would likely have been sufficient to ensure that the service always started, but the combination of the three results in a more reliable result and broader compatibility with differing hardware.

The cost to the user is a 2 second delay that would not be noticeable anyway, since even booting to a TTY this entire process should have completed before the shell starts, and certainly before the user could type a command; and any automated usage or services depending on ydotool should be employing systemd mechanics to wait for the service to start, anyway - so it can't cause any breakage in a properly configured system, and should not even be noticeable even in the most demanding of scenarios.

In the worst-case scenario where there would be a failure to start on the first attempt, and retries causing a longer delay (something I was not able to model, even in a VM with a single core limited to 3.6GHz and only 2 GB of memory), it is plausible that a user might notice that the service has not yet started - but in those cases, the service would have completely failed in the previous incarnation of this unit file, so the worst case scenario is still an improvement.

Of course, a complete failure to start is still possible with this unit file, but if that happens, something must be wrong with the system, and the root cause would be elsewhere.

I hope this patch is useful and the above description explains my analysis and the reasoning for the changes.

YHNdnzj commented 9 months ago

Sorry, but I'm pretty sure this not the right thing to do. First of all, this changes the target from default.target to basic.target with no convincing reason. And combining that with an arbitrarily-selected sleep really makes no sense.

If you want retries with a lower rate, use RestartSteps= and RestartMaxDelaySec=. If things start too early, please add more appropriate ordering deps. But adding a sleep and be done with it really won't make things more "reliable", but the opposite direction.

lilydjwg commented 9 months ago

I manually link ydotoold.service to my xprofile.target.wants/, which is started by my ~/.xprofile, at which time my session is surely active.

Restarting on failure in this case doesn't make any sense to me. Sometimes I do not start a GUI session. Instead, I ssh into it from another computer. ydotoold kept restarting and was very annoying so I moved it to my xprofile.target.

pallaswept commented 8 months ago

I'm pretty sure this not the right thing to do.

You may be right, ultimately - but consider that before remodelling the entire thing, it might make sense to try something which introduces a small changeset, and thus a lower risk of introducing a breaking change. Especially if you're a random unknown person contributing a patch, un-invited, and unknown to the maintainer. That is the intention, here; to keep my footprint small.

I'd prefer to do it the "Right Way", according to upstream documentation, so if you have some suggestions along those lines that might be an improvement to this, might I suggest you file a PR, and you are welcome to @ me in there if you want my input.

this changes the target from default.target to basic.target with no convincing reason

It follows systemd's docs...

arbitrarily-selected sleep

It wasn't arbitrary... but the dataset was stupid small. Feel free to expand it by buying me more computers :grin: or you could time all your PCs (or better yet, suggest a better dependent unit!)

Restarting on failure in this case doesn't make any sense to me

That was not introduced by this change.

Sometimes I do not start a GUI session.

This is why I didn't use graphical.target. It seems an obvious candidate at first, but there's a bunch of edge-cases one can imagine where the presence or lack of a GUI doesn't necessarily determine whether ydotool should try to start yet.

ReimuNotMoe / ydotool

Make systemd unit file more reliable as a user service #221