EliasOenal / multimon-ng

GNU General Public License v2.0
923 stars 206 forks source link

Bug - Multimon silently killed on Debian Bullseye 64 bits after 10-15 hours #194

Closed moving-digital closed 2 years ago

moving-digital commented 2 years ago

Hi @EliasOenal

I'm facing a critical issue with multimonNG, regardless of the version/commit used.

The process will get killed silently after 10-15 hours.

At least, I'm able to confirm it's happening on Raspberry Pi model 3B with ARM chipset, on Debian Bullseye 64 bit, which I'm using.

I'm using multimon with the DTMF demod.

I would like to help as much as possible but since there is no log nor trace, I'm unable to do so, especially because I don't know what to look for/where to look at.

I hope somebody could help me troubleshoot and/or fix this issue...

weinholt commented 2 years ago

I would like to help as much as possible but since there is no log nor trace, I'm unable to do so, especially because I don't know what to look for/where to look at.

You can use gdb to debug this issue. You need gdb and debug symbols for multimon-ng. Prepare your system like this:

echo deb https://deb.debian.org/debian-debug/ bullseye-debug main >> /etc/apt/sources.list
apt update
apt install gdb multimon-ng-dbgsym

The next step is to start multimon-ng the same way you normally do.

Once multimon-ng is running you can then start gdb and attach it to the running multimon-ng process: gdb /usr/bin/multimon-ng $(pidof multimon-ng)

Immediately type c and enter, to continue running multimon-ng.

When multimon-ng crashes or gets killed there will be some output from gdb explaining the current state of the process. If there is still a process then run bt to get a backtrace. Please paste the output in this issue and we'll try to figure out where to go from there.

(I haven't verified that these instructions work, there might be some typos or other reasons it doesn't work as written).

moving-digital commented 2 years ago

Thanks, @weinholt

I've done the above steps, we now need to wait 10-15 hours.

However, I found out that as soon as I launch gdb, all processes involved before reaching multimon are getting "frozen".

As soon as I quit gdb, it will load all the data that didn't get displayed from the moment gdb was launched.

Is that expected ?

weinholt commented 2 years ago

However, I found out that as soon as I launch gdb, all processes involved before reaching multimon are getting "frozen".

I would expect it to unfreeze when you run the c (or cont) command in gdb.

moving-digital commented 2 years ago

Done, it's unfrozen now !

So in the end, I had to type "c" twice:

Wait and see... Thank you so much!

moving-digital commented 2 years ago

Hi @weinholt ,

Alright, it just broke after exactly 12 hours, but I may need your help to analyse further. Let me explain myself.

I'm using FFMPEG to extract audio in wav format, and pass it to Multimon using a named pipe (FIFO). The flow is as follows: Live audio feed (AAC/M1L2) --> FFMPEG --> Multimon DTMF

The error messages I got are as follows:

On gdb:

(gdb) c
Continuing.
(gdb) [Inferior 1 (process 4495) exited normally]
(gdb)

On Multimon: exited "normally", i.e. no error message The command line I'm using is: multimon-ng -q -a DTMF -t wav dtmf_fifo_wav.wav

On FFMPEG:

udp://127.0.0.1:5001?overrun_nonfatal=1&fifo_size=2500000: corrupt input packet in stream 3
[mp2 @ 0x3dbc3310] Header missing
Error while decoding stream #0:3: Invalid data found when processing input
av_interleaved_write_frame(): Broken pipe 768.0kbits/s speed=   1x
Error writing trailer of dtmf_fifo_wav.wav: Broken pipe
size= 4194315kB time=12:25:39.43 bitrate= 768.0kbits/s speed=   1x
video:0kB audio:4194317kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
Error closing file dtmf_fifo_wav.wav: Broken pipe
Conversion failed!

So now, I'm basically unsure were the issue comes from... Is it FFMPEG or Multimon?

Reading this link on stackexchange says that "a broken pipe happens when the reading process exits before the writing process", but are we talking about the reading process of FFMPEG or Multimon? https://unix.stackexchange.com/questions/84813/what-makes-a-unix-process-die-with-broken-pipe

weinholt commented 2 years ago

but are we talking about the reading process of FFMPEG or Multimon?

When multimon-ng reads a WAV file it also starts SoX, so SoX could also be involved. The ffmpeg output certainly points to ffmpeg doing something to trigger the problem. Multimon exited normally, probably because its input from SoX was closed, but then ffmpeg still wanted to use the file descriptor for the pipe to SoX, assuming it was open.

You can use strace to debug this further:

  1. Install strace with apt install strace.
  2. Start your decoding pipeline as usual.
  3. In one terminal write strace -f -p $(pidof ffmpeg).
  4. In another terminal write strace -f -p $(pidof multimon-ng).
  5. Run ps aux to find the pid of the sox process that multimon-ng spawned. It will likely be on the line following the multimon-ng process.
  6. In yet another terminal, run strace -f -p PID, but replace PID with the pid for multimon-ng's sox from the previous step.

Then please provide us with the last ~100 lines from each of the outputs, or whatever seems relevant. You could also capture the whole outputs by adding 2>&1 | tee trace-ffpeg.txt, etc, on the end of each commands, but the files created will probably be quite large.

Strace will give us a look at all the syscalls performed by ffmpeg, multimon-ng and SoX. There is no other way that these processes communicate, so we should be able to see why the pipe gets closed.

moving-digital commented 2 years ago

Thank you so much, @weinholt !

Started everything again as requested here above.

I will wait for the next crash occurrence, and provide you with all relevant info from the 3 different processes.

moving-digital commented 2 years ago

Hello @weinholt , @EliasOenal ,

As expected, it crashed again after 12 hours, and strace has spoken!

let me start by sharing the PIDs: ~ $ pidof multimon-ng: 6002 ~ $ pidof ffmpeg: 6000 ~ $ pidof sox: 6003

multimon-ng logs are:

sox WARN rate: rate clipped 10 samples; decrease volume?
sox WARN dither: dither clipped 9 samples; decrease volume?

ffmpeg's strace logs are:

[pid  6000] close(5)                    = 0
[pid  6000] write(2, "\33[1;31mError closing file dtmf_f"..., 61) = 61
[pid  6000] futex(0x943f334, FUTEX_WAKE_PRIVATE, 1) = 1
[pid  6107] <... futex resumed>)        = 0
[pid  6000] futex(0x943f3ac, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  6107] futex(0x943f2d8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  6108] <... futex resumed>)        = 0
[pid  6000] <... futex resumed>)        = 1
[pid  6108] futex(0x943f350, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  6107] <... futex resumed>)        = 0
[pid  6000] futex(0x943f350, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  6108] <... futex resumed>)        = -1 EAGAIN (Resource temporarily unavailable)
[pid  6107] madvise(0x7f990ee000, 8253440, MADV_DONTNEED <unfinished ...>
[pid  6000] <... futex resumed>)        = 0
[pid  6108] futex(0x943f350, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  6107] <... madvise resumed>)      = 0
[pid  6000] futex(0x943f424, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  6109] <... futex resumed>)        = 0
[pid  6108] <... futex resumed>)        = 0
[pid  6107] exit(0 <unfinished ...>
[pid  6000] <... futex resumed>)        = 1
[pid  6109] futex(0x943f3c8, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  6108] madvise(0x7f988ed000, 8253440, MADV_DONTNEED <unfinished ...>
[pid  6107] <... exit resumed>)         = ?
[pid  6000] futex(0x943f3c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  6109] <... futex resumed>)        = -1 EAGAIN (Resource temporarily unavailable)
[pid  6108] <... madvise resumed>)      = 0
[pid  6000] <... futex resumed>)        = 0
[pid  6109] futex(0x943f3c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  6108] exit(0 <unfinished ...>
[pid  6107] +++ exited with 0 +++
[pid  6000] futex(0x943f49c, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  6109] <... futex resumed>)        = 0
[pid  6110] <... futex resumed>)        = 0
[pid  6108] <... exit resumed>)         = ?
[pid  6000] <... futex resumed>)        = 1
[pid  6110] futex(0x943f440, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  6109] madvise(0x7f980ec000, 8253440, MADV_DONTNEED <unfinished ...>
[pid  6108] +++ exited with 0 +++
[pid  6000] futex(0x943f440, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  6110] <... futex resumed>)        = -1 EAGAIN (Resource temporarily unavailable)
[pid  6109] <... madvise resumed>)      = 0
[pid  6001] <... recvfrom resumed>"G\2\6\21\23\311A\203\232!\223J\3)\330`J\311!>&\223\n_\27\6#\355\332\212!,"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(45404), sin_addr=inet            _addr("127.0.0.1")}, [128->16]) = 1316
[pid  6000] <... futex resumed>)        = 0
[pid  6110] futex(0x943f440, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  6109] exit(0 <unfinished ...>
[pid  6001] recvfrom(4,  <unfinished ...>
[pid  6000] futex(0x7f988ec050, FUTEX_WAIT, 6109, NULL <unfinished ...>
[pid  6110] <... futex resumed>)        = 0
[pid  6109] <... exit resumed>)         = ?
[pid  6001] <... recvfrom resumed>"G\2\6\30\350\247\221\"2\342\215\fA\323\3252\21\245\215\313\273\270\226!0\351:K<.\342\177"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(45404)            , sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 1316
[pid  6110] madvise(0x7f978eb000, 8253440, MADV_DONTNEED <unfinished ...>
[pid  6109] +++ exited with 0 +++
[pid  6001] recvfrom(4,  <unfinished ...>
[pid  6000] <... futex resumed>)        = 0
[pid  6110] <... madvise resumed>)      = 0
[pid  6001] <... recvfrom resumed>"G\2\6\37\314\f\240\3!'\0\270\201WOf\304\356\256h\0\356)?@\36\200\376\f,\17\342"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(45404), sin_addr            =inet_addr("127.0.0.1")}, [128->16]) = 1316
[pid  6000] futex(0x7f980eb050, FUTEX_WAIT, 6110, NULL <unfinished ...>
[pid  6110] exit(0 <unfinished ...>
[pid  6001] recvfrom(4,  <unfinished ...>
[pid  6110] <... exit resumed>)         = ?
[pid  6001] <... recvfrom resumed>"G\2\6\25\201\347\302\321\305\17\255\2B\215#L\2\342x\7V\1\333\362$\236E0\367\304\345\233"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(45404),             sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 1316
[pid  6110] +++ exited with 0 +++
[pid  6001] recvfrom(4,  <unfinished ...>
[pid  6000] <... futex resumed>)        = 0
[pid  6001] <... recvfrom resumed>"G\2\6\34\376\v\24ku\231\20\30<,:#\16\1\361\243\35z\244\10S\34\2028\305\260\21\207"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(45404), sin_a            ddr=inet_addr("127.0.0.1")}, [128->16]) = 1316
[pid  6000] munmap(0x7f998ef000, 208896 <unfinished ...>
[pid  6001] recvfrom(4,  <unfinished ...>
[pid  6000] <... munmap resumed>)       = 0
[pid  6001] <... recvfrom resumed>"G\0f\35\r\234;6\366mx\207\210kZ\331Y\225\225\251\253S_w\336\325\245\255\316\315\300\205"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(45404),             sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 940
[pid  6000] munmap(0x7f99922000, 208896 <unfinished ...>
[pid  6001] recvfrom(4,  <unfinished ...>
[pid  6000] <... munmap resumed>)       = 0
[pid  6000] getpid()                    = 6000
[pid  6000] tgkill(6000, 6001, SIGRTMIN) = 0
[pid  6001] <... recvfrom resumed>0x93e2958, 65536, 0, 0x7f9a7fd6f0, [128]) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[pid  6000] futex(0x7f9a7fe050, FUTEX_WAIT, 6001, NULL <unfinished ...>
[pid  6001] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=6000, si_uid=1000} ---
[pid  6001] getpid()                    = 6000
[pid  6001] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f8f8eb000
[pid  6001] munmap(0x7f8f8eb000, 7426048) = 0
[pid  6001] munmap(0x7f94000000, 59682816) = 0
[pid  6001] mprotect(0x7f90000000, 135168, PROT_READ|PROT_WRITE) = 0
[pid  6001] mprotect(0x7f90021000, 335872, PROT_READ|PROT_WRITE) = 0
[pid  6001] mprotect(0x7f90073000, 462848, PROT_READ|PROT_WRITE) = 0
[pid  6001] futex(0x3a214f4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
[pid  6001] madvise(0x7f99ffe000, 8253440, MADV_DONTNEED) = 0
[pid  6001] exit(0)                     = ?
[pid  6000] <... futex resumed>)        = 0
[pid  6001] +++ exited with 0 +++
munmap(0x7f990ee000, 8392704)           = 0
close(4)                                = 0
munmap(0x7f9a7ff000, 470003712)         = 0
write(2, "Conversion failed!\n", 19)    = 19
ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0
close(3)                                = 0
exit_group(1)                           = ?
+++ exited with 1 +++

multimon-ng's strace logs are:

read(3, "\316\1\314\2a\2\t\1\27\0\f\1\353\4\331\3\322\2\327\4\314\3\336\5\360\10\372\n\35\v\326\10"..., 16384) = 4096
read(3, "\24\376w\376\336\0<\2\250\1<\4\363\4g\5\231\7\250\7\320\0103\10q\6\377\5C\4A\3"..., 16384) = 4096
read(3, "m\376e\374\372\370\t\370\v\367}\371\252\370\27\375\0\377\35\1\32\0`\377\36\377\210\376\335\377=\0"..., 16384) = 4096
read(3, "\325\0\305\0[\0\273\0\261\0\211\0\274\0\233\0\222\0\202\0b\0L\0O\0\27\0\17\0\v\0"..., 16384) = 4096
read(3, "/\367\16\6R\5\0\373W\373\221\0042\1q\375d\375\255\5\277\0h\373\3\4\20\377h\1\341\373"..., 16384) = 4096
read(3, "\263\375\341\3759\376\177\376\232\376\256\376\217\376\300\376\350\376\6\377\f\377\322\376\325\376\332\376\0\377\32\377"..., 16384) = 4096
read(3, "G\375\243\375(\376\262\3768\377\343\377\214\0E\1\330\1B\2s\2\224\2\217\2l\2L\2\27\2"..., 16384) = 4096
read(3, "\344\0*\1\302\1\214\2\250\0[\0\207\377\302\1\202\1l\1\335\0\220\0\22\376\30\376\302\376K\377"..., 16384) = 4096
read(3, "\3\0\n\0\3\0\371\377\362\377\360\377\355\377\362\377\6\0\23\0\n\0\23\0(\0'\0(\0\35\0"..., 16384) = 4096
read(3, "\5\0\331\377\301\377\327\377\210\377\320\377z\377\372\377\225\377\256\3776\0H\377\251\0\177\377h\0P\0"..., 16384) = 4096
read(3, "\243\7O\10\237\374\"\2{\372\4\2\266\375\313\3\244\376Z\377\21\1S\376\266\3\320\373y\1M\3"..., 16384) = 4096
read(3, "\6\0\364\0\266\0\332\377C\0u\0\34\0D\0T\0\321\377\0\377h\0\226\376P\377\272\377\271\377"..., 16384) = 4096
read(3, "t\373$\370\205\365\322\362N\361\31\360\311\356\366\355I\355\313\356\272\357K\362\35\363s\365\n\367\17\371"..., 16384) = 4096
read(3, "\307\374U\372\333\372\215\373N\374\356\372\262\370U\3676\370@\373\215\376C\0\320\376J\373=\370\36\367"..., 16384) = 4096
read(3, "\336\373j\374\306\374\254\374\6\374\326\373\257\3746\376\344\377=\1\250\1\250\0g\377\t\377 \377\347\377"..., 16384) = 4096
read(3, "*\370\351\371C\373\211\373y\373r\373g\374\35\375\357\375+\375\247\373\310\371^\371\337\3728\376\377\1"..., 16384) = 4096
read(3, "\37\10}\6\\\5Q\5\311\5)\10 \t`\t\365\t\\\7E\7\203\10\360\n\344\t{\5o\6"..., 16384) = 4096
read(3, "\330\371\307\373\342\6W\0\350\365\252\5y\377\t\370\336\0007\2\304\2\6\373\207\371\370\n\333\374\6\363"..., 16384) = 4096
read(3, "\300\10\222\7\332\4-\0050\5\205\4\16\3\r\1H\377\202\374\361\374\352\373z\3735\373\235\372`\372"..., 16384) = 4096
read(3, "\244\374\207\376\317\375n\3727\375M\373\242\376\366\374\230\377\370\0&\377c\377\3\0003\0017\0\377\1"..., 16384) = 4096
read(3, "\v\372G\373\365\375I\375\331\376\4\376t\373r\374\222\366x\373z\371\261\3671\372\203\364\270\370\202\363"..., 16384) = 4096
read(3, "\1\373\302\3721\373E\374v\374I\375E\375\377\375|\376\306\376\n\0D\0\206\0:\1\270\1i\1"..., 16384) = 4096
read(3, "\323\373B\375\311\374\273\373|\373\301\373;\373\311\372\336\373\247\374r\374\6\375`\377\310\377%\376\347\375"..., 16384) = 4096
read(3, "\245\377I\377!\377\33\377_\377D\377Q\377P\377_\3773\377\36\377\321\376e\376\334\375\253\375\177\375"..., 16384) = 2456
read(3, "", 16384)                      = 0
close(3)                                = 0
wait4(6003, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 6003
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=6003, si_uid=1000, si_status=0, si_utime=28730, si_stime=12934} ---
exit_group(0)                           = ?
+++ exited with 0 +++

sox's strace logs are:

write(1, "*\370\351\371C\373\211\373y\373r\373g\374\35\375\357\375+\375\247\373\310\371^\371\337\3728\376\377\1"..., 4096) = 4096
write(1, "\37\10}\6\\\5Q\5\311\5)\10 \t`\t\365\t\\\7E\7\203\10\360\n\344\t{\5o\6"..., 4096) = 4096
write(1, "\330\371\307\373\342\6W\0\350\365\252\5y\377\t\370\336\0007\2\304\2\6\373\207\371\370\n\333\374\6\363"..., 4096) = 4096
write(1, "\300\10\222\7\332\4-\0050\5\205\4\16\3\r\1H\377\202\374\361\374\352\373z\3735\373\235\372`\372"..., 4096) = 4096
read(3, "H\10\200\10\201\7b\7v\10\322\6\35\4l\4\313\3\24\1-\1$\2\372\0u\0f\0\306\377"..., 12288) = 4608
read(3, "-\377\304\376h\376,\376\307\375|\375X\375\364\374\227\374\270\374\315\374p\374o\374 \375\242\375k\375"..., 8192) = 2304
read(3, "\252\25'\25\355\24\215\23\262\20\357\r\235\vb\t\262\7\237\6\0\6\335\5\251\5\213\5O\6\36\7"..., 8192) = 2304
read(3, "\325\1V\1\34\18\1\335\0J\0\314\377{\377\260\377\2\0,\0s\0f\0\31\0r\0\2\1"..., 4096) = 2304
read(3, "\336\377\341\377\334\377/\0E\0>\0u\0-\0\247\377^\0\241\1_\1\t\0\224\377-\0\303\0"..., 4096) = 2304
read(3, "a\1G\1\260\0v\0%\0U\377p\376\17\376\325\376\22\0\204\0Q\0\37\0>\0\27\0N\377"..., 4096) = 2304
read(3, "\204\1\315\1\334\1\320\1\201\1\231\2\264\4\304\4(\3(\2\247\0o\376j\376q\377&\376/\375"..., 12288) = 2304
read(3, "\277\0\211\1\325\1\310\1\203\1\21\1o\0\265\377\37\377\340\376\21\377V\377\206\377\7\0\202\0}\0"..., 12288) = 2304
read(3, "'\f\362\vg\v\225\n\371\t\374\tJ\nu\n\310\n\177\v(\f^\f;\f\f\f\343\v[\v"..., 8192) = 2304
read(3, "\5\5\"\5\31\5\v\5\16\5\344\4\225\4K\4\34\4\36\4\23\4\305\3\217\3\265\3\21\4R\4"..., 8192) = 2304
read(3, "\37\374L\374U\374M\374V\374f\374u\374\204\374{\374K\374\27\374\373\373\10\374B\374\210\374\247\374"..., 4096) = 2304
read(3, "}\375\272\3761\377.\376\245\374\274\374~\376\321\376\227\375\220\375\v\376\276\375\335\375+\376\343\375\7\376"..., 4096) = 2304
read(3, "\306\1l\1\377\0s\0\341\377t\3776\377+\377>\377L\377W\377R\3775\377#\377\34\377\24\377"..., 4096) = 2304
read(3, "\332\374\277\374\305\374\271\374\367\374\211\375\241\375'\375\v\375\216\375\363\375\354\375\345\375\26\3761\376\370\375"..., 4096) = 2304
write(1, "\244\374\207\376\317\375n\3727\375M\373\242\376\366\374\230\377\370\0&\377c\377\3\0003\0017\0\377\1"..., 4096) = 4096
write(1, "\v\372G\373\365\375I\375\331\376\4\376t\373r\374\222\366x\373z\371\261\3671\372\203\364\270\370\202\363"..., 4096) = 4096
write(1, "\1\373\302\3721\373E\374v\374I\375E\375\377\375|\376\306\376\n\0D\0\206\0:\1\270\1i\1"..., 4096) = 4096
write(1, "\323\373B\375\311\374\273\373|\373\301\373;\373\311\372\336\373\247\374r\374\6\375`\377\310\377%\376\347\375"..., 4096) = 4096
brk(0x5580c3f000)                       = 0x5580c3f000
write(2, "sox WARN rate: ", 15)         = 15
write(2, "rate clipped 10 samples; decreas"..., 41) = 41
write(2, "\n", 1)                       = 1
write(2, "sox WARN dither: ", 17)       = 17
write(2, "dither clipped 9 samples; decrea"..., 42) = 42
write(2, "\n", 1)                       = 1
brk(0x5580c0b000)                       = 0x5580c0b000
close(3)                                = 0
write(1, "\245\377I\377!\377\33\377_\377D\377Q\377P\377_\3773\377\36\377\321\376e\376\334\375\253\375\177\375"..., 2456) = 2456
exit_group(0)                           = ?
+++ exited with 0 +++
EliasOenal commented 2 years ago

@moving-digital this is an interesting issue for sure. Are you using UDP to read the data into FFMPEG? I would try to swap UDP for TCP and test if that resolves it. I understand that this is local communication, but the kernel still doesn't guarantee delivery and may silently drop any of the packets. And given that your data source is AAC (compressed), FFMPEG likely can't recover gracefully from dropped data, thus aborting. To verify this theory you could also remove multimon-ng from your pipeline (drop the wav output of FFMPEG >/dev/null) and see if just transcoding with FFMPEG is stable.

moving-digital commented 2 years ago

Hi @EliasOenal

My real use-case is actually a live audio stream using multicast (UDP/RTP), so I can't change it to TCP for that reason as it would not reproduce the real scenario...

With ffmpeg, I'm using the "nonfatal" and "fifosize" as input parameters, and if I don't run multimon-ng and that the ffmpeg output is another multicast instead of a wav file going to a named pipe, there is no issue at all for 48h, 72h... etc.

Would you still want me to do the ffmpeg test by dropping the file into > /dev/null? I can do that for you.

PS: My audio input sampling rate is 48Khz, which I just pass-through FFMPEG. No resampling is performed.

EliasOenal commented 2 years ago
With ffmpeg, I'm using the "nonfatal" and "fifosize" as input parameters, and if I don't run multimon-ng and that the ffmpeg output is another multicast instead of a wav file going to a named pipe, there is no issue at all for 48h, 72h... etc.

Would you still want me to do the ffmpeg test by dropping the file into dev/null?

To test if the theory of dropped UDP packets causing FFMPEG to abort is valid, you'd need to make sure that FFMPEG is actually transcoding (decoding) the data. Just forwarding UDP packets via multicast without decoding them would not notice any data corruption in the AAC stream.

Thus I'd still suggest to run this without multimon-ng. It may also not be a bad idea to actually save the wav file to disk. That way you can later also check whether the ~15h wav file generated by FFMPEG fed into multimon-ng causes any issues. And if it does, it would allow isolating the part causing the issue.

Lastly you could further also dump ~15h of your UDP stream to a file and then try this file with your pipe. Doing so would then allow reproducing the issue much faster without waiting in real time.

moving-digital commented 2 years ago

Hi @EliasOenal

Noted, thanks. By the way, something that I forgot to share with you but which is pretty important : right now, the live audio stream is "simulated" with a 4min long file, that is infinitely looping.

So, it makes the issue even more weird, as it doesn't seem related to the input, since this clip is looping around 180 times (!), 12 hours * 60 min / 4 min.

Nevermind, I will find a way to record the FFMPEG output when the issue is happening, and keep you guys posted.

weinholt commented 2 years ago

Some more of FFmpeg's strace logs would be needed to see what FFmpeg does to the pipe, things have already gone bonkers at the start of that trace. But we should explore another part of the system for now.

The suggestion from @EliasOenal is good, it will tell us if FFmpeg on its own would go wrong in this situation.

Now for my analysis of the traces and my suggestion for further debugging.

It's clear that multimon-ng's data from sox suddenly stops and that's why it exits. The result 0 from read(3, "", 16384) = 0 means it received an end-of-file. The last piece of data it receives is the last data that sox wrote (the payloads in the last read() and write() calls match up).

The odd thing is that there's nothing in sox's trace that indicates that the data stream from FFmpeg stopped. Apparently sox just decides it has had enough and shuts down. This suggests to me that sox "knows" that the data has ended there. We'd need to see the WAV file to determine why it did that.

To help us debug this further, please run the pipeline, but replace multimon-ng with socat:

(I'm not entirely sure that the PIPE: specification is right, but we'll see what happens). We want to see the data as it appears when sox reads it. It's therefore important to keep using a named pipe because otherwise FFmpeg might reopen/seek the WAV file to write the final length.

Please then share the file dtmf_fifo_debug.wav with us. I'm most interested in the beginning of the file, so you don't need to pipe several hours of data. Just a few seconds should be enough. I think you can attach the file in a comment here.

moving-digital commented 2 years ago

Hello @weinholt ,

Thank you so much for your guidance once again! Attached ~5-6min capture, which is more than the audio file duration (~4-5min).

Don't hesitate to let me know if you need anything else, I could even provide you with a direct remote SSH access to my machine.

dtmf_fifo_debug.zip

weinholt commented 2 years ago

Ok, we might have found the source of the problem. I've prepared a patch for you to try.

Here's what I see in the wav file:

$ file dtmf_fifo_debug.wav 
dtmf_fifo_debug.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz
$ hexdump -C dtmf_fifo_debug.wav |head -n5
00000000  52 49 46 46 ff ff ff ff  57 41 56 45 66 6d 74 20  |RIFF....WAVEfmt |
00000010  10 00 00 00 01 00 01 00  44 ac 00 00 88 58 01 00  |........D....X..|
00000020  02 00 10 00 4c 49 53 54  1a 00 00 00 49 4e 46 4f  |....LIST....INFO|
00000030  49 53 46 54 0e 00 00 00  4c 61 76 66 35 39 2e 31  |ISFT....Lavf59.1|
00000040  37 2e 31 30 32 00 64 61  74 61 ff ff ff ff 00 00  |7.102.data......|

Note the bytes 64 61 74 61 ff ff ff ff. These bytes mean that there's a data chunk that is 0xFFFFFFFF bytes long. How much time is represented by that many bytes?

$ units
Currency exchange rates from FloatRates (USD base) on 2022-07-20 
3675 units, 109 prefixes, 114 nonlinear units

You have: (0xFFFFFFFF byte)/(16 bit * 44100 Hz)
You want: hours
    * 13.526604
    / 0.073928386

So this means that the WAV file is telling sox that the data is ~13.5 hours long. Does your pipeline break down after that long?

If you are comfortable building multimon-ng with a patch then we can verify that this is the source of the problem. Here is the patch: https://github.com/weinholt/multimon-ng/commit/5e080770c0360892b59325eb80c124d53e481d1a (completely untested by myself, btw)

If this patch makes the problem go away then we have some ways we can proceed:

  1. Apply the patch as-is, i.e. always pass --ignore-length. I don't know when this flag appeared in sox, and maybe it will break someone's workflow if it is always passed? But it could also be fine.
  2. Add a flag to multimon-ng to activate the --ignore-length flag. The drawback is that users need to be aware of this dynamic between FFmpeg and multimon-ng, and they need to pass the flag.
  3. Figure out some way to automatically add the flag. Maybe check if the file is a pipe and then always pass the flag?
  4. Fix FFmpeg so that it uses the magic value for unspecified length instead of 0xFFFFFFFF. See below.

Sox recognizes a special length value that FFmpeg could use for the initial WAV header. This is from sox/src/wav.c:

/* Magic length writen when its not possible to write valid lengths.
 * This can be either because of non-seekable output or because
 * the length can not be represented by the 32-bits used in WAV files.
 * When magic length is detected on inputs, disable any length
 * logic.
 */
#define MS_UNSPEC 0x7ffff000

If FFmpeg had used that length for the data chunk then I believe you wouldn't have run into this problem.

moving-digital commented 2 years ago

Hi @weinholt ,

So this means that the WAV file is telling sox that the data is ~13.5 hours long. Does your pipeline break down after that long?

Yes, always around 10-15 hours, and I noticed mostly around 12-14 hours (I never checked to the minute), so that's pretty much inline !

Let me try the patch now, and run it for the next 14 hours :-) Keep you posted!

moving-digital commented 2 years ago

Hi @weinholt ,

You're a genius: your patch is working like a charm! It's been 14 hours now, and still running 👍

Do you need my help for anything else, or you and @EliasOenal will discuss together to find the best implementation first?

weinholt commented 2 years ago

Do you need my help for anything else, or you and @EliasOenal will discuss together to find the best implementation first?

I think we can figure it out. Thanks for testing the patch! Btw, there's an FFmpeg flag that you should probably use, if you're not already. Place this before the input url/filename: -err_detect ignore_err.

@EliasOenal What do you think should be done? I looked and the --ignore-length flag has been in sox since 2008, so it should be supported everywhere.

I also think FFmpeg should be fixed to use MS_UNSPEC, but that's another story.

moving-digital commented 2 years ago

Hi @weinholt ,

Thanks for testing the patch!

You're most welcome, thank you!

Btw, there's an FFmpeg flag that you should probably use, if you're not already. Place this before the input url/filename: -err_detect ignore_err.

Good guess, I wasn't... Thanks a lot for the suggestion, just added it in my script. I've just launched another burn-in test to double confirm it's still all good after 14 hours, just to reassure myself :)

EliasOenal commented 2 years ago

@weinholt sure, "--ignore-length" sounds like a perfectly fine default. I can't think of a situation where this would cause unwanted behaviour. Please open a pull request. Also good job on identifying the issue. 👍