charlie-foxtrot / RTLSDR-Airband

Multichannel AM/NFM demodulator
GNU General Public License v3.0
758 stars 135 forks source link

systemd service file doesn't restart service on crash (non zero exit) #74

Closed Kami closed 6 years ago

Kami commented 7 years ago

First, thanks for working on this project and great work.

Current systemd service file doesn't restart service on crash, it contains Restart=no - https://github.com/szpajder/RTLSDR-Airband/blob/master/init.d/rtl_airband.service.

I assume this way done to avoid going in a restart loop in case service quite due to an invalid configuration file (this could be prevented by setting StartLimitBurst and StartLimitIntervalSec option.)

I think it would still make sense to restart service on crash, because there scenarios where service crashes and restart brings it back up without any other interaction needed.

For example, from my log file:

Oct 18 15:06:01 raspberrypi-zagorje rtl_airband[3346]: Connecting to <url>...
Oct 18 15:06:01 raspberrypi-zagorje rtl_airband[3346]: Connected to <url>
Oct 18 15:06:02 raspberrypi-zagorje rtl_airband[3346]: Device #0: gain set to 15.70 dB
Oct 18 15:06:02 raspberrypi-zagorje rtl_airband[3346]: Device 0 started.
Oct 19 00:56:23 raspberrypi-zagorje rtl_airband[3346]: Exceeded max backlog for <url>, disconnecting
Oct 19 00:56:23 raspberrypi-zagorje rtl_airband[3346]: Lost connection to <url>
Oct 19 00:56:33 raspberrypi-zagorje rtl_airband[3346]: Trying to reconnect to <url>...
Oct 19 00:56:33 raspberrypi-zagorje rtl_airband[3346]: Connecting to <url>...
Oct 19 00:56:33 raspberrypi-zagorje rtl_airband[3346]: Connected to <url>
Oct 19 01:50:07 raspberrypi-zagorje rtl_airband[3346]: Lost connection to <url>
Oct 19 01:50:14 raspberrypi-zagorje rtl_airband[3346]: Trying to reconnect to <url>...
Oct 19 01:50:14 raspberrypi-zagorje rtl_airband[3346]: Connecting to <url>...
Oct 19 01:50:15 raspberrypi-zagorje rtl_airband[3346]: Connected to <url>
Oct 19 10:11:10 raspberrypi-zagorje rtl_airband[3346]: Device #0: async read failed, disabling
Oct 19 10:11:10 raspberrypi-zagorje rtl_airband[3346]: Disabling mixer 'mixer1' - all inputs died
Oct 19 10:11:10 raspberrypi-zagorje rtl_airband[3346]: Closing connection to <url>
Oct 19 10:11:10 raspberrypi-zagorje rtl_airband[3346]: All receivers failed, exiting
Oct 19 10:11:10 raspberrypi-zagorje rtl_airband[3346]: Freeing GPU memory

This happens quite often and simply restarting the service brings it back up and solves the problem (Restart=on-failure in systemd service file).

szpajder commented 7 years ago

From my experience, rtl_airband aborts are most often caused by USB connectivity problems (eg. due to RPi being underpowered). Once the dongle gets disconnected, there is no point in restarting rtl_airband before the dongle is reconnected manually. That's why I decided to disable automatic restarting. The supplied service file is just an example, there is no "one size fits all" configuration. You are free to modify it as you wish.

barjac commented 6 years ago

I have two dongles one for fr24feed and one for rtl_airband. Feeding pulse output to two servers with one off line I find that the rtl_airband service closes at random after maybe an hour or so. It's a Pi2 and does not seem heavily loaded. Load average in top is around 0.5. I have added the following in the systemd unit file [Service] section and it restarts the service, so far without a problem: Restart=always RestartSec=3

With the above the sequence of events when the service dies is:

rtl_airband.service: Main process exited, code=killed, status=6/ABRT
rtl_airband.service: Unit entered failed state.
rtl_airband.service: Failed with result 'signal'.
rtl_airband.service: Service hold-off time over, scheduling restart.
Stopped SDR AM/NFM demodulator.
Starting SDR AM/NFM demodulator...
RTLSDR-Airband version 2.4.0 starting
Started SDR AM/NFM demodulator.
pulse: 192.168.1.65: connecting...
pulse: 192.168.1.126: connecting...
Device #1: gain set to 25.40 dB
Device 1 started.
pulse: 192.168.1.65: connected
192.168.1.65: stream "Airband" connected
pulse: 192.168.1.126: connection failed: Connection refused

Googling the error I'm wondering if the service stopping is systemd related? Something along these lines: https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=7&cad=rja&uact=8&ved=0ahUKEwjJtKqAqP7YAhUHrRQKHQZZCE4QFghSMAY&url=https%3A%2F%2Fgithub.com%2Fsystemd%2Fsystemd%2Fissues%2F2942&usg=AOvVaw17sFYN2yTjZJeiAh0UVvRA

szpajder commented 6 years ago

It might be a bug in rtl_airband or in Pulseaudio. Too little information here.

Are there any log entries before "Main process exited" line? Messages like "Assertion (...) failed" could shed some light on this.

barjac commented 6 years ago

Out_of_mem.txt I have attached a journal output and system state from top. After 8 hours continuous it was fine - an hour later the above happened which repeats for ever. Looks like the only way out is a re-boot, but memory state looks fine to me. How can we get more verbosity from rtl_airband? Would gdb or strace help or just be too much for the Pi? Certainly no Asserts anywhere.

szpajder commented 6 years ago

If there was a SIGABRT (which is visible in the log), then it was almost certainly triggered by a failed assertion.

Could you run the program by hand, in the foreground (ie. with the -f option) and let it run until it breaks? You should see a failed assertion printed to the screen then.

It might also be valuable to enable coredumps beforehand (ie. run "ulimit -c unlimited" in the terminal just before starting rtl_airband from there).

barjac commented 6 years ago

Here you go:

15/ 10 121.850 13/ 10 121.850 13/ 10 118.575rtl_airband: asyncns.c:1500: asyncns_isdone: Assertion `q' failed. Aborted0 118.575 pi@raspberrypi:~ $

There is a core, which I'm guessing may be useful to get a backtrace if you can advise. I am accessing RPi over ssh only.

barjac commented 6 years ago

I am not familiar with debian package management or the debuginfo package naming they use, so without any extra debuginfo this is what I got: bt.txt Is it any use?

szpajder commented 6 years ago

Thanks for this, assertion message is enough, I've located the place of the crash. It's Pulseaudio calling asyncns_isdone function (from libasyncdns) with a NULL pointer as an argument. Obviously this is incorrect, but before blaming Pulseaudio I'll double-check if it's not due to rtl_airband interacting with Pulseaudio in some wrong way.

Which Raspbian version are you running? (lsb_release -a will tell you if you're not sure).

barjac commented 6 years ago

Raspbian GNU/Linux 9.1 (stretch)

szpajder commented 6 years ago

I managed to get a pulse-related crash on my Pi, but with a completely different assertion failure.

Could you please run the program in foreground once or twice again and observe if it always fails with the same error message or with different ones? I suspect a race condition here, which may cause the program to crash randomly at various places in the code.

If you add some more pulse outputs to bogus servers in your config, you will hopefully make the crash more probable. It might save you some time.

Thanks in advance.

barjac commented 6 years ago

Only one so far same as before - that took 4 hours with 4 streams! 11/ 9 133.125 10/ 9 133.125rtl_airband: asyncns.c:1500: asyncns_isdone: Assertion `q' failed. Aborted9 133.125

It's quiet here at night so may get more tomorrow. I'm assuming it won't happen with nothing to stream.

barjac commented 6 years ago

Ah another different! 10/ 10 121.850 13/ 10 121.850 13/ 10 118.575Assertion '!m->rebuild_pollfds' failed at pulse/mainloop.c:841, function pa_mainloop_poll(). Aborting. Aborted0 121.850

barjac commented 6 years ago

And another, Assertion '!m->rebuild_pollfds' failed at pulse/mainloop.c:841, function pa_mainloop_poll(). Aborting.

szpajder commented 6 years ago

The root cause of all these is probably the same. I can reproduce it now, so I can troubleshoot by myself. I'll get back to you once I have a fix.

In the meantime, just disable pulse outputs to non-working servers. It shall work reliably then.

barjac commented 6 years ago

Great news! Thanks. I notice running manually with rtl_airband -f that there is always a 'PLL not locked!' warning, yet all works fine. Maybe this is being checked too soon before the PLL has chance to lock?

szpajder commented 6 years ago

'PLL not locked' is harmless.

I think I fixed the problem - the fix is currently in unstable branch. You can get it from the web (https://github.com/szpajder/RTLSDR-Airband/tree/unstable) via the green button at the top right or just install Git client, if you don't have it (apt-get install git) and then run git clone https://github.com/szpajder/RTLSDR-Airband.git. It will create a directory RTLSDR-Airband in your current directory, so first move your old source directory out of the way. Then compile, install and run as usual.

I've also made a couple of other changes:

The goal of these changes is to have better cooperation with systemd. When running under systemd with -F -e command line options, rtl_airband will not print its log messages directly to syslog, but to standard error. It's systemd's job to capture and save them to the log file. The benefit is that any assertion failures (if any) will also get written to the logs, so it will not be necessary to run rtl_airband in foreground to see them.

It's not mandatory to use the new options, but if you wish, then edit your unit file /etc/systemd/system/rtl_airband.service and do the following:

Then save and run sudo systemctl daemon-reload to update systemd's running configuration.

The example unit file init.d/rtl_airband.service has been updated to reflect these changes (see https://github.com/szpajder/RTLSDR-Airband/blob/unstable/init.d/rtl_airband.service).

barjac commented 6 years ago

OK it's up and running. I put the Restart=no back in the unit file. Thanks for the nice instructions - you only forgot one step 'git checkout unstable' so it choked on the -F and I realized that was not the right source. We will await the morning to see if it's still up. Finger's crossed :)

barjac commented 6 years ago

Excellent - well done! It has now been rock solid since last night after suspending either or both machines running PA servers multiple times this morning. One thing I have seen (in either version) is that when a machine is suspended and resumed it gains an extra volume slider in the Plasma5 volume control dialogue (under applications tab -> playback streams) in the systray. I had seven on the laptop before re-booting it. Any ideas on whether this may be a PA or plasma5 issue?

szpajder commented 6 years ago

Great, thanks for reporting, testing and feedback.

You may use pacmd list-sink-inputs command on the PA server machine to get the current list of streams from PulseAudio. If these zombie streams are there, it means PulseAudio haven't noticed that they are gone. If they are not there, it's probably a fault on the volume control app side.

barjac commented 6 years ago

Seems then that PA is not seeing that they are gone. So would you suggest a bug report to PA upstream, or is there something about the way the streams are opened, that keeps them for ever?

[baz@leno ~]$ pacmd list-sink-inputs 5 sink input(s) available. index: 0 driver: flags: START_CORKED state: DRAINED sink: 0 volume: mono: 50415 / 77% / -6.84 dB balance 0.00 muted: no current latency: 0.00 ms requested latency: 105.00 ms sample spec: float32le 1ch 8000Hz channel map: mono Mono resample method: ffmpeg module: 12 client: 0 properties:
media.name = "AirbandLeno"
application.name = "rtl_airband"
native-protocol.peer = "TCP/IP client from 192.168.1.96:37514"
native-protocol.version = "32"
application.process.id = "22992"
application.process.user = "root"
application.process.host = "raspberrypi"
application.process.binary = "rtl_airband"
application.language = "C"
application.process.machine_id = "41b2082c495a49eb98535e0841ee519c"
module-stream-restore.id = "sink-input-by-application-name:rtl_airband" index: 7 driver: flags: START_CORKED state: DRAINED sink: 0 volume: mono: 50415 / 77% / -6.84 dB balance 0.00 muted: no current latency: 0.00 ms requested latency: 105.00 ms sample spec: float32le 1ch 8000Hz channel map: mono Mono resample method: ffmpeg module: 12 client: 17 properties: media.name = "AirbandLeno" application.name = "rtl_airband" native-protocol.peer = "TCP/IP client from 192.168.1.96:37724" native-protocol.version = "32" application.process.id = "22992" application.process.user = "root" application.process.host = "raspberrypi" application.process.binary = "rtl_airband" application.language = "C" application.process.machine_id = "41b2082c495a49eb98535e0841ee519c" module-stream-restore.id = "sink-input-by-application-name:rtl_airband" index: 15 driver: flags: START_CORKED state: DRAINED sink: 0 volume: mono: 54395 / 83% / -4.86 dB balance 0.00 muted: no current latency: 0.00 ms requested latency: 105.00 ms sample spec: float32le 1ch 8000Hz channel map: mono Mono resample method: ffmpeg module: 12 client: 30 properties: media.name = "AirbandLeno" application.name = "rtl_airband" native-protocol.peer = "TCP/IP client from 192.168.1.96:39304" native-protocol.version = "32" application.process.id = "22992" application.process.user = "root" application.process.host = "raspberrypi" application.process.binary = "rtl_airband" application.language = "C" application.process.machine_id = "41b2082c495a49eb98535e0841ee519c" module-stream-restore.id = "sink-input-by-application-name:rtl_airband" index: 80 driver: flags: START_CORKED state: DRAINED sink: 0 volume: mono: 54395 / 83% / -4.86 dB balance 0.00 muted: no current latency: 0.00 ms requested latency: 105.00 ms sample spec: float32le 1ch 8000Hz channel map: mono Mono resample method: ffmpeg module: 12 client: 39 properties: media.name = "AirbandLeno" application.name = "rtl_airband" native-protocol.peer = "TCP/IP client from 192.168.1.96:40268" native-protocol.version = "32" application.process.id = "22992" application.process.user = "root" application.process.host = "raspberrypi" application.process.binary = "rtl_airband" application.language = "C" application.process.machine_id = "41b2082c495a49eb98535e0841ee519c" module-stream-restore.id = "sink-input-by-application-name:rtl_airband" index: 133 driver: flags: START_CORKED state: DRAINED sink: 0 volume: mono: 54395 / 83% / -4.86 dB balance 0.00 muted: no current latency: 0.00 ms requested latency: 105.00 ms sample spec: float32le 1ch 8000Hz channel map: mono Mono resample method: ffmpeg module: 12 client: 42 properties: media.name = "AirbandLeno" application.name = "rtl_airband" native-protocol.peer = "TCP/IP client from 192.168.1.96:41086" native-protocol.version = "32" application.process.id = "22992" application.process.user = "root" application.process.host = "raspberrypi" application.process.binary = "rtl_airband" application.language = "C" application.process.machine_id = "41b2082c495a49eb98535e0841ee519c" module-stream-restore.id = "sink-input-by-application-name:rtl_airband"

szpajder commented 6 years ago

Streams are carried over plain TCP connections. When you hibernate the server, the client (rtl_airband) detects stream latency increasing and eventually exceeding the threshold, which causes it to close the connection to PA server and try again. However the FIN packets closing the connection will not make it to the server, because it's hibernated, so after you wake up the server, it will still think the connection is open and PA server will attempt to continue reading from it. But it's not possible to detect TCP peer death only by reading from the socket (it looks like the client is not sending anything). TCP keepalives would have to be enabled on the connection to make this possible, however PA does not seem to utilize keepalives on TCP connections with clients. So the socket on the server will probably hang indefinitely.

I don't know if there is a good solution to this case, nor I'm even sure if the above is correct int the PA context. You may want to ask a question on PA mailing list or file them an issue - at least you will know for sure.

szpajder commented 6 years ago

Fix for the PA assertion issue released in version 3.0.0.

As far as the original issue is concerned - the example service file does not force a restart of rtl_airband on crash, however it's just a config file, so anybody is free to customize it to suit his/her needs.

Closing.

barjac commented 6 years ago

For your reference I have reported the suspend/resume issue with pulseaudio upstream here: https://bugs.freedesktop.org/show_bug.cgi?id=104994