canonical / lightdm

Display Manager
GNU General Public License v3.0
845 stars 139 forks source link

LightDM failsafe handler and restart directives #340

Closed wallentx closed 5 months ago

wallentx commented 10 months ago

This PR is in reference to https://github.com/canonical/lightdm/issues/337

In a few different scenarios, I've experienced situations where a child process spawned by LightDM (such as x11) fails to start. As per the instructions in lightdm.service, upon failure, LightDM will Restart=always. The problem that can come from this, is when one of those services are failing for a very specific reason that requires the user to make changes. In a scenario where such a service will fail 100% of the time, this will cause LightDM to Restart=always 100% of the time, when will eventually trigger systemd's rate-limiting mechanism, resulting in logs that would appear like this:

systemd[1]: lightdm.service: Service hold-off time over, scheduling restart.
systemd[1]: Stopped LightDM.                        
systemd[1]: lightdm.service: Start request repeated too quickly.
systemd[1]: Failed to start LightDM.                
systemd[1]: lightdm.service: Unit entered failed state.       
systemd[1]: lightdm.service: Failed with result 'exit-code'. 

The problem is that, depending on the failure, the user might only see a black screen, with no indication that anything is wrong, except that the screen is black, and nothing is loading. The user would only be able to discover details in the aforementioned logs if they were to switch to a different TTY, log in, and check their logs. From here, they would be able to discover why the failure happened, and fix the issue.

The changes that I am proposing are essentially a shortcut to what I just described.

The lightdm.service file now includes StartLimit IntervalSec and Burst values, a RestartSec rate, and an OnFailure action. The values I've proposed I think make sense: The service has a restart delay of 5 seconds in between resets, and if the service restarts more than 5 times within a 60s period, then the service will be in a failed state. In the event of it being in a failed state, it triggers the OnFailure action:debian/lightdm-failure-handler.service

This service makes a copy of /etc/issue (the MOTD-like message you see above a TTY login) to /etc/issue.restore, and writes in a new message "hint" to /etc/issue that lets the user know what to do. It then changes the user to TTY4 (I only chose 4 because I've noticed others to be reserved for other things, or unavailable.. someone should check my reasoning here though) so that they can see that message, and log in to troubleshoot.

Upon successful login of LightDM, if an /etc/issue.restore file exists, it will replace /etc/issue (which presumably has the "hint" we wrote earlier). If no .restore file is found, no action is taken.

I couldn't find anything else in the repo specifically that referenced the packaging or manifest of the .service files, but maybe that is handled elsewhere.

github-actions[bot] commented 10 months ago

Everyone contributing to this PR have now signed the CLA. Thanks!

wallentx commented 10 months ago

Hey! wallentx has not signed the Canonical CLA which is required to get this contribution merged on this project. Please head over to ubuntu.com/legal/contributors to read more about it.

I signed this, btw, but maybe my launchpad ID was supposed to be my launchpad account email?

robert-ancell commented 10 months ago

Can you please resolve the above conflicts? Thanks.

robert-ancell commented 8 months ago

Sorry, @wallentx needs a rebase again.

iphands commented 8 months ago

This is great @wallentx I want this too!

Its rather maddening when you get caught in a restart loop and as you are at some tty, the restart occurs and in the middle of typing you get flipped back to tty1 as things fail.

Having some back off / retry limit would be great ootb config!

github-actions[bot] commented 7 months ago
Hey! wallentx has not signed the Canonical CLA which is required to get this contribution merged on this project. Please head over to https://ubuntu.com/legal/contributors to read more about it.
wallentx commented 5 months ago

Hey @robert-ancell I've been held up on attempting to fix my Canonical CLA issue because I found a bug/exploit/bypass with the Canonical cla-check itself, and have a PR that shows it's existence, and also includes the fix https://github.com/canonical/has-signed-canonical-cla/pull/38

If I fix my CLA, then the evidence of the flaw disappears. Do you know anyone who might be able to at least ack the details of my other PR so that I can be freed up to move forward here?

robert-ancell commented 5 months ago

@wallentx thanks for working on this! I've pinged the person who seems to be maintaining the CLA checker and hopefully they'll land that fix soon :)