AllStarLink / app_rpt

Refactoring and upgrade of AllStarLink's app_rpt, etc.
4 stars 4 forks source link

core stop gracefully hangs #166

Open tsawyer opened 1 year ago

tsawyer commented 1 year ago

*CLI> core stop gracefully hangs the cli and continuously prints:

[2023-05-22 08:27:38.780] ERROR[788418] manager.c: Unable to process manager action 'RptStatus'. Asterisk is shutting down.
[2023-05-22 08:27:38.784] ERROR[788418] manager.c: Unable to process manager action 'RptStatus'. Asterisk is shutting down.
[2023-05-22 08:27:38.788] ERROR[788418] manager.c: Unable to process manager action 'RptStatus'. Asterisk is shutting down.
InterLinked1 commented 1 year ago

Typically this indicates the channels container is locked somewhere in some kind of deadlock. Manager requests aren't processed during shutdown so that bit is to be expected.

A full backtrace is definitely needed.

tsawyer commented 1 year ago

I believe this is the full backtrace from that time.

core-asterisk-2023-05-22T14-11-21Z-full.txt

tsawyer commented 1 year ago

This might be a clew why core stop gracefully hangs. Notice the channels "Refusing due to active shutdown" messages. Also note that the channels attempt to restart when shutting down.

FYI... I created this console dump by starting with asterisk -c as I am concerned that the ast-asterisk.service might impact debugging the issue. Then doing core stop now.


      core stop now
[2023-06-08 12:01:24.877] Beginning asterisk shutdown....
[2023-06-08 12:01:24.978]     -- Hungup 'DAHDI/pseudo-161833083'
[2023-06-08 12:01:24.982]     -- Hungup 'DAHDI/pseudo-979897630'
[2023-06-08 12:01:24.984]     -- Hungup 'DAHDI/pseudo-11751437'
[2023-06-08 12:01:24.993]     -- Hungup 'DAHDI/pseudo-635202703'
[2023-06-08 12:01:24.994]     -- Hungup 'DAHDI/pseudo-1007224127'
[2023-06-08 12:01:24.999]     -- Hungup 'DAHDI/pseudo-1903152809'
[2023-06-08 12:01:25.008]     -- Hungup 'DAHDI/pseudo-1968121433'
[2023-06-08 12:01:25.009]     -- Hungup 'DAHDI/pseudo-410232704'
[2023-06-08 12:01:25.015]     -- Hungup 'DAHDI/pseudo-1555853723'
[2023-06-08 12:01:25.022]     -- Hungup 'DAHDI/pseudo-1706859608'
[2023-06-08 12:01:25.023]     -- Hungup 'DAHDI/pseudo-1106395156'
[2023-06-08 12:01:25.030]     -- Hungup 'DAHDI/pseudo-552964170'
[2023-06-08 12:01:25.038]     -- Hungup 'DAHDI/pseudo-1952052834'
[2023-06-08 12:01:25.051]     -- Hungup 'DAHDI/pseudo-414503075'
[2023-06-08 12:01:25.056]     -- Hungup 'DAHDI/pseudo-305369340'
[2023-06-08 12:01:25.061]     -- Hungup 'DAHDI/pseudo-488172553'
[2023-06-08 12:01:25.268] WARNING[1217062]: app_rpt.c:6043 rpt_master: rpt_thread restarted on node 2509
[2023-06-08 12:01:25.268] WARNING[1217062]: app_rpt.c:6043 rpt_master: rpt_thread restarted on node 25334
[2023-06-08 12:01:25.269] WARNING[1217089]: channel.c:751 __ast_channel_alloc_ap: Channel allocation failed: Refusing due to active shutdown
[2023-06-08 12:01:25.270] WARNING[1217089]: chan_simpleusb.c:2356 simpleusb_request: Unable to create new usb channel
[2023-06-08 12:01:25.270] WARNING[1217089]: app_rpt.c:3032 rpt: Sorry unable to obtain Rx channel
[2023-06-08 12:01:25.273] WARNING[1217088]: channel.c:751 __ast_channel_alloc_ap: Channel allocation failed: Refusing due to active shutdown
[2023-06-08 12:01:25.273] ERROR[1217088]: chan_voter.c:2226 voter_request: Cant alloc new asterisk channel
[2023-06-08 12:01:25.273] WARNING[1217088]: app_rpt.c:3032 rpt: Sorry unable to obtain Rx channel
[2023-06-08 12:01:25.276] Asterisk cleanly ending (0).
[2023-06-08 12:01:25.277] Executing last minute cleanups
[2023-06-08 12:01:25.279]   == Manager unregistered action DBGet
[2023-06-08 12:01:25.280]   == Manager unregistered action DBGetTree
[2023-06-08 12:01:25.280]   == Manager unregistered action DBPut
[2023-06-08 12:01:25.280]   == Manager unregistered action DBDel
[2023-06-08 12:01:25.281]   == Manager unregistered action DBDelTree
InterLinked1 commented 12 months ago

Looks like a NULL lock that libcurl tries to grab:

#1  __GI___pthread_rwlock_rdlock (rwlock=0x0) at pthread_rwlock_rdlock.c:27
        result = <optimized out>

Probably an issue with res_rpt_http_registrations though; have you seen this again since?

tsawyer commented 8 months ago

Revisiting this issue, hasn't reproduced the libcurl problem. It is locking up the Asterisk console with various process still running...

ps aux | grep asterisk
root      158819 18.4  0.4  57396  9980 pts/0    R+   09:12   0:24 rasterisk r
ps aux | grep asterisk
asterisk  158997 11.6  2.9 2203928 59552 ?       Ssl  09:17   0:17 /usr/sbin/asterisk -g -f -p -U asterisk
asterisk  158998  0.0  0.0   2328   892 ?        S    09:17   0:00 astcanary /var/run/asterisk/alt.asterisk.canary.tweet.tweet.tweet 158997
root      159103  0.0  0.4  57392  9888 pts/1    S+   09:18   0:00 rasterisk r
InterLinked1 commented 8 months ago

Revisiting this issue, hasn't reproduced the libcurl problem. It is locking up the Asterisk console with various process still running...

If it's happening again, a fresh backtrace is needed from each recurrence to compare. I actually don't think it's libcurl either, probably something else.

tsawyer commented 8 months ago

How do we get a backtrace if there's not a coredump?

InterLinked1 commented 8 months ago

How do we get a backtrace if there's not a coredump?

/var/lib/asterisk/scripts/ast_coredumper --RUNNING

tsawyer commented 8 months ago

The system I was testing (@jxmx debs) does not have gdb. Moved to another server to get this coredump.

Looks like chan_voter did not reload?

core-asterisk-2023-11-23T18-12-04Z-full.txt

InterLinked1 commented 8 months ago

The system I was testing (@jxmx debs) does not have gdb. Moved to another server to get this coredump.

Looks like chan_voter did not reload?

core-asterisk-2023-11-23T18-12-04Z-full.txt

What's the CPU usage look like? Seems like it could be an infinite loop.

tsawyer commented 8 months ago

Looks like chan_voter did not reload?

My mistake. Actually that's not from core stop gracefully. I did a module reload chan_voter when moving the RTCM to this server.

tsawyer commented 8 months ago

Here's a dump from core stop gracefully with --RUNNING

This one mentions asterisk_daemon

core-asterisk-running-2023-11-23T18-32-48Z-full.txt

InterLinked1 commented 8 months ago

Looks like chan_voter did not reload?

My mistake. Actually that's not from core stop gracefully. I did a module reload chan_voter when moving the RTCM to this server.

Still, it might be a separate bug. What was the CPU usage?

jxmx commented 8 months ago

I uploaded the debug symbol packages to https://github.com/AllStarLink/asl3-asterisk/releases/tag/v20.5.0%2Basl3-0.0.10.fadead4.deb12-1

Just install them and gdb will work fine.

InterLinked1 commented 8 months ago

Here's a dump from core stop gracefully with --RUNNING

This one mentions asterisk_daemon

core-asterisk-running-2023-11-23T18-32-48Z-full.txt

Looking at thread 1 isn't appropriate here since this isn't from a crash fault.

But looking at the CLI thread, it's waiting for all channels to exit, which makes sense. Some of the RPT modules weren't designed to exit gracefully so there are still some things to clean up here.