asterisk / asterisk

The official Asterisk Project repository.
https://www.asterisk.org
Other
2.24k stars 977 forks source link

[bug]: Enabling FEC for OPUS causes Asterisk to stop responding #450

Open FatimaJaffal opened 12 months ago

FatimaJaffal commented 12 months ago

Severity

Major

Versions

18.17.1, 20.5.0

Components/Modules

chan_sip,bridge,channel,ari

Operating Environment

EC2 Ubuntu 22.04.2 LTS jammy Asterisk built from sources

Frequency of Occurrence

Occasional

Issue Description

After a few days of running Asterisk on the server it stops responding to SIP signals. This start happening occasionally and we're not facing unprecedented load.

We noticed the following:

  1. Existing calls cannot continue.
  2. Threads are failing to lock the channels object.
  3. Asterisk has not crashed and exited.
  4. From the CLI we can do some commands with no problem, however some core show channel channelID/BridgeID will cause CLI to stuck.
  5. Failing to kill the process without -9.
  6. Restarting Asterisk service fix the issue.

We generated core dumps on different servers when the issue happened:

core-asterisk-running-2023-11-05T20-11-47Z-brief.txt core-asterisk-running-2023-11-05T20-11-47Z-full.txt

core-asterisk-running-2023-11-07T12-13-34Z-full.txt core-asterisk-running-2023-11-07T12-13-34Z-brief.txt

core-asterisk-running-2023-11-13T13-51-36Z-brief.txt core-asterisk-running-2023-11-13T13-51-36Z-full.txt

We tried to attach the core dump (core-asterisk-running-2023-11-07T12-13-34Z) to gdb (See output log below) and noticed the following:

  1. All threads after Thread 1333 waiting lock 0x558fdf9256d0.
  2. This lock owned by Thread 1333 ((LWP 2410739)).
  3. Thread 1333 waiting for lock 0x7f54b8991740 which is owned by Thread 142.

System details:

We are using Asterisk with ARI Nodejs library to create and handle channels/bridges.

Asterisk config files: modules.txt ari.txt codecs.txt extensions.txt sip.txt

Call Flow:

  1. We use Janus for WebRTC.
  2. Janus sends INVITEs to Asterisk.
  3. We redirect the call to the ARI app application.
  4. In the ARI app, we create a new bridge and add the channel to it.
  5. We send a request using ARI channel#create endpoint to create a new channel and add it to the bridge.
  6. We dial the new channel in the bridge.
  7. Once the second channel is up we create two snoop channels for it (for recordings).
  8. The call can be put on hold using the ARI hold/unhold endpoints, and we have a hold bridge where we add all on hold channels to it.
  9. Another Janus peer can join the bridge or whisper using snoop channel.
  10. We do transcoding between the channels (opus <--> ulaw,alaw).

Relevant log output

(gdb) p *((pthread_mutex_t *) 0x558fdf9256d0)
$7 = {__data = {__lock = 2, __count = 1, __owner = 2410739, __nusers = 1, __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
  __size = "\002\000\000\000\001\000\000\000\363\310$\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}
(gdb) p *((pthread_mutex_t *) 0x558fdf9256d0)
(gdb) p *((pthread_mutex_t *) 0x7f54b8991740)
$11 = {__data = {__lock = 2, __count = 1, __owner = 2319100, __nusers = 1, __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
  __size = "\002\000\000\000\001\000\000\000\374b#\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}
Thread 142 (Thread 0x7f5143e88640 (LWP 2319100)):
#0  __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x558fdf931bdc) at ./nptl/futex-internal.c:103
#1  __GI___futex_abstimed_wait64 (futex_word=futex_word@entry=0x558fdf931bdc, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:128
#2  0x00007f5537a9b24f in __pthread_rwlock_wrlock_full64 (abstime=0x0, clockid=0, rwlock=0x558fdf931bd0) at ./nptl/pthread_rwlock_common.c:730
#3  ___pthread_rwlock_wrlock (rwlock=0x558fdf931bd0) at ./nptl/pthread_rwlock_wrlock.c:26
#4  0x0000558fdd9bc141 in __ast_rwlock_wrlock (filename=0x558fddb62e59 "stasis_channels.c", line=1106, func=0x558fddb63b50 <__PRETTY_FUNCTION__.22> "ast_channel_publish_snapshot", t=0x558fdf931bd0, name=0x558fddb62ec1 "channel_cache_by_name") at lock.c:998
#5  0x0000558fdd8eacf0 in __ao2_lock (user_data=0x558fdf931c28, lock_how=AO2_LOCK_REQ_WRLOCK, file=0x558fddb62e59 "stasis_channels.c", func=0x558fddb63b50 <__PRETTY_FUNCTION__.22> "ast_channel_publish_snapshot", line=1106, var=0x558fddb62ec1 "channel_cache_by_name") at astobj2.c:253
#6  0x0000558fdda63f76 in ast_channel_publish_snapshot (chan=0x7f54b8991780) at stasis_channels.c:1106
#7  0x0000558fdda63cb8 in ast_channel_stage_snapshot_done (chan=0x7f54b8991780) at stasis_channels.c:1051
#8  0x0000558fdd8fdbb5 in ast_bridge_vars_set (chan=0x7f54b8991780, name=0x0, pvtid=0x0) at bridge.c:1217
#9  0x0000558fdd9179ca in ast_bridge_channel_leave_bridge_nolock (bridge_channel=0x7f52ec03db70, new_state=BRIDGE_CHANNEL_STATE_END, cause=0) at bridge_channel.c:306
#10 0x0000558fdd917a77 in ast_bridge_channel_leave_bridge (bridge_channel=0x7f52ec03db70, new_state=BRIDGE_CHANNEL_STATE_END, cause=0) at bridge_channel.c:318
#11 0x0000558fdd918caf in ast_bridge_channel_kick (bridge_channel=0x7f52ec03db70, cause=0) at bridge_channel.c:623
#12 0x0000558fdd91e8a1 in bridge_handle_trip (bridge_channel=0x7f52ec03db70) at bridge_channel.c:2562
#13 0x0000558fdd91ef91 in bridge_channel_wait (bridge_channel=0x7f52ec03db70) at bridge_channel.c:2771
#14 0x0000558fdd91fa44 in bridge_channel_internal_join (bridge_channel=0x7f52ec03db70) at bridge_channel.c:2947
#15 0x0000558fdd8ff36b in bridge_channel_depart_thread (data=0x7f52ec03db70) at bridge.c:1720
#16 0x0000558fdda95f06 in dummy_start (data=0x7f52ec1b37c0) at utils.c:1574
#17 0x00007f5537a94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#18 0x00007f5537b26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Asterisk Issue Guidelines

FatimaJaffal commented 12 months ago

Following up: In core-asterisk-running-2023-11-13T13-51-36Z-brief.txt dump there are a lot of threads that seem to have a corruption in the stackframe:

  1. In gdb: Ran thread apply all bt to get all the backtraces.
  2. Picked the backtraces of a couple of threads with corruption: #11 0x0000557d57f2fcf0 in __ao2_lock (user_data=0x0, lock_how=AO2_LOCK_REQ_MUTEX, file=0x9408ed9800007fad <error: Cannot access memory at address 0x9408ed9800007fad>, func=0x5805983800000000 <error: Cannot access memory at address 0x5805983800000000>, line=32684, var=0x0) at astobj2.c:253
  3. Found the LWP numbers for all threads that have corruption: cat bts | grep -B13 corrupt | grep Thread | awk '{print $6}' | sed "s/)//" | sort -n
  4. Got the list of what every thread is doing according to Asterisk using core show threads in the CLI, and filtered on the threads from the step above: for x in $(cat lwps); do grep $x threads; done

Here are the results, almost all those threads are at exactly the same place:

0x7faf6d769640 1581969 do_monitor           started at [30114] chan_sip.c restart_monitor()
0x7fab6a5a1640 4061002 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab55c2b640 4084886 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab5a22f640 4085611 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7facb2fe2640 4085748 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab5be81640 4085887 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab4a812640 4085926 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab5f531640 4085930 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab431d6640 4085969 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7facb0ac6640 4086050 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab5c845640 4086106 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7facb1dd1640 4086369 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab683f0640 4086639 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab5e41a640 4087371 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab2b1d6640 4087659 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab5e708640 4087862 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab54870640 4087945 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7facb148a640 4087953 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab40824640 4087998 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab58c36640 4088003 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab5a2ac640 4088088 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab5a59a640 4088134 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab58e2a640 4088293 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fae5305f640 4088335 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7faf6d4f8640 4088510 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab43e88640 4088525 bridge_channel_depart_thread started at [ 1848] bridge.c bridge_impart_internal()
0x7fab57495640 4088853 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab60855640 4089200 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab6148a640 4089368 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab531bc640 4089549 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab60d37640 4090008 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab62a83640 4090085 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab6075b640 4090240 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab6a812640 4090362 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab4891e640 4090398 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab5f437640 4090470 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab6967e640 4090597 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab4aa06640 4090599 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab4b9a6640 4090708 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7facb0cba640 4090737 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab541e6640 4090833 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab6af65640 4090838 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab5db50640 4090951 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7faf6c170640 4091179 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab42e6b640 4091261 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab4288f640 4091293 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab4b7b2640 4091316 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab5f149640 4091386 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab6a795640 4091498 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab50741640 4091507 bridge_channel_depart_thread started at [ 1848] bridge.c bridge_impart_internal()
0x7fab5c93f640 4091546 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab567e3640 4091601 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab6b541640 4091602 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab6b34d640 4091654 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab55f96640 4091809 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab5b922640 4091820 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab4916b640 4091822 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab2bc17640 4091832 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab6b5be640 4091910 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab5c74b640 4092017 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab63929640 4092092 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab4935f640 4092105 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fae508ed640 4092157 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab6a69b640 4092197 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab58a42640 4092235 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7facb37b2640 4092292 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab6996c640 4092321 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab5b5b7640 4092358 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab5ae64640 4092363 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab59500640 4092445 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab6a236640 4092474 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab2b63b640 4092503 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab6bd11640 4092571 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7fab68085640 4092589 ari_channel_thread   started at [ 1929] ari/resource_channels.c ast_ari_channels_create()
0x7facb1872640 4092652 bridge_channel_depart_thread started at [ 1848] bridge.c bridge_impart_internal()

Thread 1041 (Thread 0x7facb1872640 (LWP 4092652) "asterisk"):
#0  0x00007faf71e91059 in __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x7faf71e91059 <__GI___futex_abstimed_wait_cancelable64+41>, clockid=<optimized out>, expected=0, futex_word=0x557d59e626fc) at ./nptl/futex-internal.c:83
#1  __GI___futex_abstimed_wait_cancelable64 (futex_word=0x557d59e626fc, expected=0, clockid=<optimized out>, abstime=0x7faf71e91059 <__GI___futex_abstimed_wait_cancelable64+41>, private=0) at ./nptl/futex-internal.c:139
#2  0x0000557d581a8b50 in ?? ()
#3  0x0000557d581a7ec1 in ?? ()
#4  0x0000557d59e626f0 in ?? ()
#5  0x0000557d581a8b50 in ?? ()
#6  0x00000452e009cdc8 in ?? ()
#7  0x0000557d581a7e59 in ast_stream_topology_changed_external ()
#8  0x00000002b1871be0 in ?? ()
#9  0x0000000059e5c498 in ?? ()
#10 0x00007facb1871c00 in ?? ()
#11 0x0000557d57f2fcf0 in __ao2_lock (user_data=0x0, lock_how=AO2_LOCK_REQ_MUTEX, file=0x9408ed9800007fad <error: Cannot access memory at address 0x9408ed9800007fad>, func=0x5805983800000000 <error: Cannot access memory at address 0x5805983800000000>, line=32684, var=0x0) at astobj2.c:253
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 1033 (Thread 0x7fab68085640 (LWP 4092589) "asterisk"):
#0  0x00007faf71e91059 in __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x7faf71e91059 <__GI___futex_abstimed_wait_cancelable64+41>, clockid=<optimized out>, expected=0, futex_word=0x557d59e626fc) at ./nptl/futex-internal.c:83
#1  __GI___futex_abstimed_wait_cancelable64 (futex_word=0x557d59e626fc, expected=0, clockid=<optimized out>, abstime=0x7faf71e91059 <__GI___futex_abstimed_wait_cancelable64+41>, private=0) at ./nptl/futex-internal.c:139
#2  0x0000557d581a8b50 in ?? ()
#3  0x0000557d581a7ec1 in ?? ()
#4  0x0000557d59e626f0 in ?? ()
#5  0x0000557d581a8b50 in ?? ()
#6  0x00000452e016d508 in ?? ()
#7  0x0000557d581a7e59 in ast_stream_topology_changed_external ()
#8  0x00000002680845c0 in ?? ()
#9  0x0000000059e5c498 in ?? ()
#10 0x00007fab680845e0 in ?? ()
#11 0x0000557d57f2fcf0 in __ao2_lock (user_data=0x0, lock_how=AO2_LOCK_REQ_MUTEX, file=0x9408ed9800007fad <error: Cannot access memory at address 0x9408ed9800007fad>, func=0x5805983800000000 <error: Cannot access memory at address 0x5805983800000000>, line=32684, var=0x0) at astobj2.c:253
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 1031 (Thread 0x7fab6bd11640 (LWP 4092571) "asterisk"):
#0  0x00007faf71e91059 in __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x7faf71e91059 <__GI___futex_abstimed_wait_cancelable64+41>, clockid=<optimized out>, expected=0, futex_word=0x557d59e626fc) at ./nptl/futex-internal.c:83
#1  __GI___futex_abstimed_wait_cancelable64 (futex_word=0x557d59e626fc, expected=0, clockid=<optimized out>, abstime=0x7faf71e91059 <__GI___futex_abstimed_wait_cancelable64+41>, private=0) at ./nptl/futex-internal.c:139
#2  0x0000557d581a8b50 in ?? ()
#3  0x0000557d581a7ec1 in ?? ()
#4  0x0000557d59e626f0 in ?? ()
#5  0x0000557d581a8b50 in ?? ()
#6  0x000004529800fbb8 in ?? ()
#7  0x0000557d581a7e59 in ast_stream_topology_changed_external ()
#8  0x000000026bd105c0 in ?? ()
#9  0x0000000059e5c498 in ?? ()
#10 0x00007fab6bd105e0 in ?? ()
#11 0x0000557d57f2fcf0 in __ao2_lock (user_data=0x0, lock_how=AO2_LOCK_REQ_MUTEX, file=0x9408ed9800007fad <error: Cannot access memory at address 0x9408ed9800007fad>, func=0x5805983800000000 <error: Cannot access memory at address 0x5805983800000000>, line=32684, var=0x0) at astobj2.c:253
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
FatimaJaffal commented 10 months ago

Following up with our latest findings:

We tried various steps to debug and fix the issue:

  1. Downgraded Asterisk to 18.12.1: Result: The issue persisted. Coredump files: core-asterisk-running-2023-12-26T12-15-26Z-full.txt core-asterisk-running-2023-12-26T12-15-26Z-brief.txt

  2. Reverted OPUS Configurations: Removed the OPUS configurations applied before the issue (see below). Result: The issue was resolved.

OPUS configs:

codecs.conf
[opus]
type=opus
signal=voice
fec=yes
max_bandwidth=narrow
max_playback_rate=8000
cbr=yes
bitrate=8000
complexity=4
packet_loss=0
modules.conf
load=res_format_attr_opus
  1. Isolated FEC as a potential culprit (since it's the most complex): Applied the above OPUS configurations, but with FEC disabled. Result: The issue didn't happen.

Conclusion: Enabling FEC for OPUS seems to be causing the issue.