MycroftAI / mycroft-core

Mycroft Core, the Mycroft Artificial Intelligence platform.
https://mycroft.ai
Apache License 2.0
6.51k stars 1.27k forks source link

Prevent critical failure of audio services #2988

Open krisgesling opened 3 years ago

krisgesling commented 3 years ago

Description

If an audioservice throws an exception when starting playback, the service_lock is never released. Playback cannot be initiated again until the Audio Service is restarted.

I haven't yet found exactly what's failing, and hence how to replicate the issue. Supposedly VLC was not entirely crashing, instead hanging and blocking forever in one of three methods in self.play():

        selected_service.clear_list()
        selected_service.add_list(tracks)
        selected_service.play(repeat)

Thanks to @JarbasAI @HelloChatterbox for catching this.

How to test

Need to find a way to reliably reproduce it first...

Contributor license agreement signed?

pep8speaks commented 3 years ago

Hello @krisgesling! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2021-08-31 20:09:49 UTC
forslund commented 3 years ago

The context manager for the lock handles the exception automatically so the explicit exception handling is not really necessary. The change looks good though, handling broken service objects :+1:

Buuut if there is a hang somewhere neither context manager or explicit exceptions would help...

devops-mycroft commented 3 years ago

Voight Kampff Integration Test Succeeded (Results)

forslund commented 3 years ago

@JarbasAI mentioned the VLC version in chat. It may be useful to update the requirement, it's been a long while.

krisgesling commented 3 years ago

@JarbasAI mentioned the VLC version in chat. It may be useful to update the requirement, it's been a long while.

Yeah, thought I'd do that in a separate PR. Potentially pull the service out to a plugin at the same time.

mcdonc commented 2 years ago

@JarbasAI mentioned the VLC version in chat. It may be useful to update the requirement, it's been a long while.

Yeah, thought I'd do that in a separate PR. Potentially pull the service out to a plugin at the same time.

It might be the place to start. 1.1.2 was released in 2015 (https://pypi.org/project/python-vlc/1.1.2/ ) and the most current release was April 2021.

mcdonc commented 2 years ago

FWIW, I locally updated python-vlc to the latest (python-vlc==3.0.12118), changed the default audio backend to vlc, and it indeed played the news and passed audiobackend unit tests successfully at least using the newer lib. I know this doesn't mean much for testing purposes, as the unit tests use a totally mocked vlc; I'll see if I can get the VK tests running in a config that prefers to use vlc.

Another potential workaround for hangs would be to run the service processes using Supervisor (http://supervisord.org). It has an event system that can be hooked to control processes that run under it. If, say, each service emitted some sort of "I'm-still-alive" event every so often, a supervisor event listener could be written that, were it to not receive such an event within some timeframe, it could kill and restart the service. A example of such an event listener is "superlance", which attempts to get a 200 OK from an http server every so often, and if it doesn't, will restart one or more processes that are run under supervisor. https://github.com/Supervisor/superlance/blob/master/superlance/httpok.py .

Anyway, introduction of supervisor would be a bit invasive, but its introduction might be practical in the face of buggy libs, tough-to-test edge cases and whatnot.

mcdonc commented 2 years ago

I created a branch in a fork of mycroft-core to document attempting to run the vk tests using the newer python-vlc lib version here: https://github.com/MycroftAI/mycroft-core/compare/dev...mcdonc:feature/newpythonvlc?expand=1

These tests passed, FWIW.

mcdonc commented 2 years ago

And here's another branch as a proof of concept to start "all" services under supervisor, just in case it interested you.

https://github.com/MycroftAI/mycroft-core/compare/dev...mcdonc:feature/supervisorservices?expand=1

It does not yet implement an eventlistener to restart hung services, but it would restart services when they crash. It also rotates service log files as necessary.

krisgesling commented 2 years ago

Thanks @mcdonc, I hadn't seen supervisord before, and haven't had a good look through the package yet, but we had been thinking about whether to go down a similar path: https://docs.google.com/document/d/1REXWaUt1HUrbyaunB09cfQ435krdNwhjVbMVVG3iNKs/edit#

We haven't started any work on it yet, and a big question for me is how much we can leverage what systemd already does vs what we need to control in a different way etc.

I'd be interested in any thoughts you have on it.

mcdonc commented 2 years ago

systemd user services would work just as well to restart crashed services, I think. supervisord was written before systemd existed, and has been maintained since then pretty well; I have used it on many projects but haven't really kept up with features added to systemd that it didn't have during that period. In particular, I don't know if systemd has the equivalent of supervisor's event listener system (http://supervisord.org/events.html) and log rotation. If it does, I'd go with systemd, unless it's of some maintenance benefit for the supervisor to be wrirtten in Python. Full disclosure: I am the original author of supervisord.

In any case, having some parent process restart crashed services would be a good start. I'm also not sure if the current configuration rotates log files; that can be a pain to configure in Python logging.

But having read your spec document, I definitely would not write another bespoke process manager.

EDIT: I looked at Jinx's comment to your process spec document and it seems that he has created a set of systemd units for Mycroft at https://github.com/j1nx/mycroft-systemd that looks promising by its description. I don't quite understand the "sd_notify" call nor the "software watchdog" he mentions in the document, but it likely covers most of what supervisor's event listener system might for you.