asterisk / asterisk

The official Asterisk Project repository.
https://www.asterisk.org
Other
1.99k stars 929 forks source link

[bug]: Asterisk lockup when using realtime dialplan or any other non-core pbx engine #668

Open creslin287 opened 2 months ago

creslin287 commented 2 months ago

Severity

Major

Versions

18.22.0, 20.7.0, 21.2.0

Components/Modules

pbx.c

Operating Environment

Requires use of realtime dialplan. Was discovered with a dialplan that used an include that contained a switch to a realtime backend.

Frequency of Occurrence

Occasional

Issue Description

A complete lockup can occur when using a realtime dialplan backend. When a realtime dialplan does a lookup to an included context with an external pbx switch (and possible in general when referencing a any realtime backed context) at the same time as a CNG tone is detected on a channel within an autoservice thread, Asterisk can completely lock up. I just got done going through a core dump with 400 threads ultimately blocked due to not being able to get access to conlock in main/pbx.c.

Here are the two most relevant thread stacks from a core dump of a lockup that demonstrates the issue:

Thread 64 (Thread 0x7f523e011640 (LWP 931102)):

0 futex_wait (private=0, expected=2, futex_word=0x55fc844c5f20 ) at ../sysdeps/nptl/futex-internal.h:146

1 __GI___lll_lock_wait (futex=futex@entry=0x55fc844c5f20 , private=0) at ./nptl/lowlevellock.c:49

2 0x00007f5303cf205d in lll_mutex_lock_optimized (mutex=0x55fc844c5f20 ) at ./nptl/pthread_mutex_lock.c:48

3 ___pthread_mutex_lock (mutex=mutex@entry=0x55fc844c5f20 ) at ./nptl/pthread_mutex_lock.c:128

4 0x000055fc842a26fc in ast_pthread_mutex_lock (filename=filename@entry=0x55fc844147d4 "pbx.c", lineno=lineno@entry=8516, func=func@entry=0x55fc84417e10 <__PRETTY_FUNCTION.22> "ast_rdlock_contexts", mutex_name=mutex_name@entry=0x55fc844148f9 "&conlock", t=t@entry=0x55fc844c5f20 ) at lock.c:347

5 0x000055fc842c796b in ast_rdlock_contexts () at pbx.c:8516

6 pbx_extension_helper (c=c@entry=0x7f525448ae80, context=context@entry=0x7f525448b840 "ivr_8015", exten=exten@entry=0x7f52a0ae04c5 "fax", priority=priority@entry=1, label=label@entry=0x0, callerid=callerid@entry=0x7f525428c7d0 "RT_XXXXX", action=E_MATCH, found=0x0, combined_find_spawn=0, con=0x0) at pbx.c:2887

7 0x000055fc842c95a9 in ast_exists_extension (c=c@entry=0x7f525448ae80, context=context@entry=0x7f525448b840 "ivr_8015", exten=exten@entry=0x7f52a0ae04c5 "fax", priority=priority@entry=1, callerid=callerid@entry=0x7f525428c7d0 "RT_XXXXX") at pbx.c:4178

8 0x00007f52a0ad81c1 in chan_pjsip_cng_tone_detected (f=, session=, ast=0x7f525448ae80) at chan_pjsip.c:799

9 chan_pjsip_read_stream (ast=0x7f525448ae80) at chan_pjsip.c:935

10 0x000055fc84245e17 in __ast_read (chan=chan@entry=0x7f525448ae80, dropaudio=dropaudio@entry=0, dropnondefault=dropnondefault@entry=1) at channel.c:3751

11 0x000055fc84247870 in ast_read (chan=chan@entry=0x7f525448ae80) at channel.c:4294

12 0x000055fc841fbb39 in autoservice_run (ign=ign@entry=0x0) at autoservice.c:135

13 0x000055fc8435e636 in dummy_start (data=) at utils.c:1607

14 0x00007f5303ceeac3 in start_thread (arg=) at ./nptl/pthread_create.c:442

15 0x00007f5303d80850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

and

Thread 368 (Thread 0x7f50ec46d640 (LWP 1228733)):

0 0x00007f5303d3f7f8 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7f50ec469250, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78

1 0x00007f5303d44677 in __GI___nanosleep (req=req@entry=0x7f50ec469250, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/nanosleep.c:25

2 0x00007f5303d75f2f in usleep (useconds=useconds@entry=1000) at ../sysdeps/posix/usleep.c:31

3 0x000055fc841fc36a in ast_autoservice_stop (chan=chan@entry=0x7f525448ae80) at autoservice.c:314

4 0x000055fc842c6076 in pbx_find_extension (chan=chan@entry=0x7f525448ae80, bypass=bypass@entry=0x0, q=q@entry=0x7f50ec469b30, context=0x7f529049e8c0 "dbndirs", exten=exten@entry=0x7f525448b890 "s", priority=priority@entry=37, label=, callerid=, action=) at pbx.c:2752

5 0x000055fc842c6474 in pbx_find_extension (chan=chan@entry=0x7f525448ae80, bypass=bypass@entry=0x0, q=q@entry=0x7f50ec469b30, context=, exten=exten@entry=0x7f525448b890 "s", priority=priority@entry=37, label=, callerid=, action=) at pbx.c:2771

6 0x000055fc842c6474 in pbx_find_extension (chan=chan@entry=0x7f525448ae80, bypass=bypass@entry=0x0, q=q@entry=0x7f50ec469b30, context=context@entry=0x7f525448b840 "ivr_8015", exten=exten@entry=0x7f525448b890 "s", priority=priority@entry=37, label=, callerid=, action=) at pbx.c:2771

7 0x000055fc842c79bc in pbx_extension_helper (c=c@entry=0x7f525448ae80, context=0x7f525448b840 "ivr_8015", exten=exten@entry=0x7f525448b890 "s", priority=priority@entry=37, label=label@entry=0x0, callerid=callerid@entry=0x7f525428c7d0 "RT_XXXXXX", action=E_SPAWN, found=0x7f50ec46cbf4, combined_find_spawn=1, con=0x0) at pbx.c:2896

8 0x000055fc842c9c19 in ast_spawn_extension (combined_find_spawn=1, found=, callerid=0x7f525428c7d0 "RT_16109925000", priority=37, exten=0x7f525448b890 "s", context=, c=0x7f525448ae80) at pbx.c:4203

9 __ast_pbx_run (c=c@entry=0x7f525448ae80, args=args@entry=0x0) at pbx.c:4377

10 0x000055fc842cb675 in pbx_thread (data=data@entry=0x7f525448ae80) at pbx.c:4701

11 0x000055fc8435e636 in dummy_start (data=) at utils.c:1607

12 0x00007f5303ceeac3 in start_thread (arg=) at ./nptl/pthread_create.c:442

13 0x00007f5303d80850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 368 owns the conlock in pbx.c, grabbed by pbx_extension_helper(), which then calls pbx_find_extension() through various includes, eventually hitting one with a switch statement to a realtime backend. Prior to querying the backend, it does an ast_autoservice_start(), and after querying it does an ast_autoservice_stop().

ast_autoservice_stop() is supposed to return once the channel is completely removed from the autoservice thread. Ordinarily, this is fine, but if the channel within the autoservice thread (Thread 64) detects a fax when doing an ast_read(), it attempts to do an ast_exists_extension() to check for a "fax" extension, and this lookup attempts to grab the &conlock as well, which is held by Thread 368. This causes the autoservice thread (Thread 64) to block until Thread 368, which is trying to run an ast_autoservice_stop() and wait for the autoservice thread to release whatever channel it's waiting to hold indefinitely.

So ordering is as follows:

Thread 368 (realtime lookup thread): Grabs conlock -> eventually stops autoservice and waits for autoservice thread loop number to increment (and ast_read() to complete) but never happens because Thread 64 is stuck waiting for conlock

Thread 64 (autoservice thread): Does ast_read() -> receives fax tone in channel pvt read handler and then does "fax" exten lookup which tries to wait on conlock and never can get it from Thread 368 due to busy loop waiting for autoservice thread counter incrementing.

My suggestion is to remove ast_autoservice_start() and ast_autoservice_stop() wrappers when calling to a realtime backend. They were originally added to help when realtime backends took a long time to respond, and sometimes this delay on servicing the channel would cause kernel level DAHDI channel buffers to overrun and drop audio data and miss fax detection tones as reported in:

https://github.com/asterisk/asterisk/commit/21a22fa584436b4417aa44a131979a1d2fbe8e72

The commit is a bit complicated, but the autoservice_start()/_stop() wrapping was added here. In this version, the start happens prior to grabbing conlock (by ast_rdlock_contexts()) and stop() happens after conlock is released, which should be a safe ordering and avoid the problem above. Because this would cause autoservice to start for every type of dialplan lookup, realtime or not, it was then changed to reduce the scope of the autoservice_start() and _stop() to just do it for an external pbx lookup in a later commit, linked below. Unfortunately this tightening of the _start()/_stop() inadvertently changed the ordering so that autoservice_stop() is attempted while conlock is still held, which causes the "deadlock" as mentioned above.

This is the patch that introduced the faulty ordering change: https://github.com/asterisk/asterisk/commit/f52c0686054c0cde4ab92e89150099e79e5e6c5e

I believe the consequence of sometimes missing a couple of kernel DAHDI buffers caused by removing both remaining sets of autoservice_start()/_stop() calls from pbx_findextension() in pbx.c when querying an external pbx backend to be much less severe than potentially causing a complete lockup of Asterisk when using a non-native dialplan pbx switch like realtime, dundi, lua, iax2, or whatever when a fax comes in at just the wrong time.

Relevant log output

No response

Asterisk Issue Guidelines