asterisk / asterisk

The official Asterisk Project repository.
https://www.asterisk.org
Other
2.18k stars 971 forks source link

[bug]: Issue with overloaded task processors #788

Closed dovi5988 closed 4 weeks ago

dovi5988 commented 3 months ago

Severity

Minor

Versions

20.4.0, 21.1.0

Components/Modules

MusicOnHold

Operating Environment

Both asterisk versions are running on Debian 12 on VMWare

Frequency of Occurrence

Occasional

Issue Description

Hi,

This issue started with my posts to the Asterisk forums. Please see https://community.asterisk.org/t/overloaded-task-processors/101581/14 and https://community.asterisk.org/t/overloaded-task-processors-follow-up/103096.

In summary the processor for stasis/p:endpoint:PJSIP/endpoint-external would randomly start crawling up at a random time and would never go back down. There were some anecdotal issues with getting sound files but I suspect that's more of a symptom then the cause of the issue. This happens randomly. The boxes can be up for weeks with no issues and it can happen a few days after an Asterisk restart. I wrote a script that should it notice the task processor for "stasis/p:endpoint:PJSIP/endpoint-external" going up to run /var/lib/asterisk/scripts/ast_coredumper --RUNNING --no-default-search --asterisk-bin=/usr/sbin/asterisk. I tried to sanitize the logs as best as possible of private information, at first manually so the CLI's and the called numbers may not match up between the different logs. I have the originals if needed. core-asterisk-running-2024-06-27T00-25-05Z-thread1.txt core-asterisk-running-2024-06-27T00-25-05Z-info.txt core-asterisk-running-2024-06-27T00-25-05Z-full.txt core-asterisk-running-2024-06-27T00-25-05Z-brief.txt back_trace.txt core-asterisk-running-2024-06-27T00-25-05Z-locks.txt

Relevant log output

No response

Asterisk Issue Guidelines

jcolp commented 2 months ago

Unfortunately the backtraces don't show anything abnormal - it shows a normal running Asterisk. I suspect this needs fresh eyes with a fresh perspective, so I'll leave this in triage for someone else to look at.

mbradeen commented 2 months ago

@dovi5988 when in this state, what is the output from pjsip show endpoints?

dovi5988 commented 2 months ago

@mbradeen I wasn't logging it at the time the issue happened. The box is "in that state now" so I was able to get the output as it is now (not from when it happened). I can add it to my bash script for the next time it happens.

a15-c3-njr2*CLI> pjsip show endpoints

 Endpoint:  <Endpoint/CID.....................................>  <State.....>  <Channels.>
    I/OAuth:  <AuthId/UserName...........................................................>
        Aor:  <Aor............................................>  <MaxContact>
      Contact:  <Aor/ContactUri..........................> <Hash....> <Status> <RTT(ms)..>
  Transport:  <TransportId........>  <Type>  <cos>  <tos>  <BindAddress..................>
   Identify:  <Identify/Endpoint.........................................................>
        Match:  <criteria.........................>
    Channel:  <ChannelId......................................>  <State.....>  <Time.....>
        Exten: <DialedExten...........>  CLCID: <ConnectedLineCID.......>
==========================================================================================

 Endpoint:  CDOutbound                                      Not in use    0 of inf
    OutAuth:  CDOutbound-oauth/800
        Aor:  CDOutbound                                    0
      Contact:  CDOutbound/sip:xx.xx.xx.10         bc8145df77 NonQual         nan
   Identify:  CDOutbound-identify/CDOutbound
        Match: xx.xx.xx.10/32

 Endpoint:  CT                                      Not in use    0 of inf
    OutAuth:  CT-oauth/690
        Aor:  CT                                    0
      Contact:  CT/sip:xx.xx.xx.10         bc8145df77 NonQual         nan
   Identify:  CT-identify/CT
        Match: xx.xx.xx.10/32

 Endpoint:  GGOutbound                                    Not in use    0 of inf
    OutAuth:  GGOutbound-oauth/55559998
        Aor:  GGOutbound                                  0
      Contact:  GGOutbound/sip:xx.xx.xx.10       bc8145df77 NonQual         nan
   Identify:  GGOutbound-identify/GGOutbound
        Match: xx.xx.xx.10/32

 Endpoint:  LBOutbound                                      Not in use    0 of inf
    OutAuth:  LBOutbound-oauth/55559999
        Aor:  LBOutbound                                    0
      Contact:  LBOutbound/sip:xx.xx.xx.10         bc8145df77 NonQual         nan
   Identify:  LBOutbound-identify/LBOutbound
        Match: xx.xx.xx.10/32

 Endpoint:  TeleC                                            Not in use    0 of inf
        Aor:  TeleC                                          0
      Contact:  TeleC/sip:yy.yy.yy.166                 33015c6998 Avail         8.222
      Contact:  TeleC/sip:yy.yy.yy.10                3f89775c62 Avail         0.818
      Contact:  TeleC/sip:yy.yy.yy.214                49eec15449 Avail         8.312
      Contact:  TeleC/sip:yy.yy.yy.146                 ab301b8925 Unavail         nan
      Contact:  TeleC/sip:xx.xx.xx.10               bc8145df77 Avail         1.013
   Identify:  TeleC-identify/TeleC
        Match: yy.yy.yy.214/32
        Match: yy.yy.yy.166/32
        Match: yy.yy.yy.146/32
        Match: yy.yy.yy.10/32
        Match: xx.xx.xx.10/32

 Endpoint:  endpoint-external                                    Unavailable   79 of inf
        Aor:  generic-aor                                        0
   Identify:  identity-fpp/endpoint-external
        Match: xx.xx.xx.10/32
        Match: xx.xx.xx.29/32
        Match: yy.yy.yy.10/32
        Match: yy.yy.yy.10/32
        Match: yy.yy.yy.11/32
   Identify:  identity-TeleC/endpoint-external
        Match: yy.yy.yy.166/32
        Match: yy.yy.yy.214/32
        Match: yy.yy.yy.126/32
        Match: yy.yy.yy.125/32
        Match: yy.yy.yy.0/30
        Match: yy.yy.yy.0/30

Objects found: 6
mbradeen commented 2 months ago

@dovi5988 thank you - are there also a lot of contacts for endpoint-external? How many taskprocessors are you seeing?

dovi5988 commented 2 months ago

@mbradeen It has two identities. One with 5 individual IP's and the other has 3 individual IP's and two more entries with a /30 for each. My script runs once a minute and the task-processors are normally 0. When the issue happens they go up and never come down. In this case attached is what it looked like.

2024-07-03.log

mbradeen commented 2 months ago

@dovi5988 I think we need a full log showing Asterisk accumulating the taskprocessors.

dovi5988 commented 2 months ago

@mbradeen Can you advise what you mean by that? Do you want the full asterisk log with any debugging? This is what I am using now, what should be added?

#!/bin/bash

# Define the minimum free space required in GB
MIN_FREE_SPACE_GB=50

# Get the available space on the root filesystem in GB
available_space=$(df --output=avail -BG / | tail -n 1 | tr -dc '0-9')

# Check if the available space is less than the minimum required
if [ "$available_space" -lt "$MIN_FREE_SPACE_GB" ]; then
    echo "Error: Less than $MIN_FREE_SPACE_GB GB free space on root file system. Exiting."
    exit 1
fi

# Define the lock file path
LOCKFILE="/var/run/ast_coredumper.lock"
DUMP_INDICATOR="/var/run/DO_DUMP.lock"
/usr/bin/rm $DUMP_INDICATOR

# Run the command and capture its output
output=$(/usr/sbin/asterisk -rx 'core show taskprocessors')

# Get the current date and time
datetime=$(/usr/bin/date "+%Y-%m-%d %H:%M:%S")

# Initialize the DO_DUMP variable
DO_DUMP=0

# Process each line of the output
echo "$output" | awk -v date="$datetime" '{
    # Check if the line starts with "Processor" or the "In queue" value is not zero
    # Assuming "In queue" is the 3rd column; adjust $3 if necessary
    if ($1 == "Processor" || ($3 != "0" && $3 != "In" && $3 != "")) {
        print "["date"] " $0
    }
}'

echo "$output" | while read -r line; do
    if echo "$line" | awk '{if ($3 ~ /^[0-9]+$/ && $3 > 500) exit 0; else exit 1}'; then
        /usr/bin/touch $DUMP_INDICATOR
#        break
    fi
done

# Check if the lock file exists
if [ -e "$LOCKFILE" ]; then
    print "["date"] " "Lock file exists. Exiting to prevent multiple instances of ast_coredumper."
    exit 1
fi

# If DO_DUMP is set to 1, run the ast_coredumper command
if [ -e "$DUMP_INDICATOR" ]; then
    echo "BOOO"
    # Create the lock file to prevent multiple instances
    touch "$LOCKFILE"

    # Run the ast_coredumper command
    /var/lib/asterisk/scripts/ast_coredumper --RUNNING --no-default-search --asterisk-bin=/usr/sbin/asterisk >> /var/log/ast_coredumper.log 2>&1

    echo "DONE!"
    # Remove the lock file after completion
    rm -f "$LOCKFILE"
    /usr/bin/rm $DUMP_INDICATOR
fi
mbradeen commented 2 months ago

@dovi5988 - My apologies, I am referring to the Asterisk log:

https://docs.asterisk.org/Configuration/Core-Configuration/Logging-Configuration/

full => debug,error,notice,verbose,warning,trace

dovi5988 commented 2 months ago

@mbradeen I went through the full logs for the time that I took the backtrace but I see nothing in the logs (no ongoing calls etc.). I restarted asterisk now and I am going to wait for it to fail. I don't mind adding a 100GB disk and adding extensive logging. core set debug 9 logs every rtp packet and that will add up too fast. any way to debug everything except the rtp packets?

asteriskteam commented 2 months ago

This issue has been marked stale because it has been open 7 days with no feedback. Please provide feedback within 7 days or this issue will be closed.

dovi5988 commented 2 months ago

The issue happened again today. I am currently traveling and hope to upload the logs and fresh dumps in the next day or so.

dovi5988 commented 2 months ago

Hi,

Please see attatched files. As per task_proc.log the issue seems to have started between 2024-07-24 17:26:01 and 2024-07-24 17:26:01. The back trace was taken by cron job at 17:32 once the queue was larger than 500. The time stamps are EDT unless otherwise specified in the logs themselves. I replaced sensative data such as calling and called numbers, peer names etc. The full log was too large so I compressed it. core-asterisk-running-2024-07-24T21-32-02Z-locks.txt core-asterisk-running-2024-07-24T21-32-02Z-info.txt core-asterisk-running-2024-07-24T21-32-02Z-full.txt back_trace.log full.log.gz task_proc.log core-asterisk-running-2024-07-24T21-32-02Z-thread1.txt core-asterisk-running-2024-07-24T21-32-02Z-brief.txt

mbradeen commented 1 month ago

Hi @dovi5988

Thank you for providing the logs and trace. From the full log, it looks like there are waves of instances of Asterisk unable to create a new thread or fork. These start before the taskprocessor jumps but continue sporadically. Is there a limit on the number of threads or processors for this system or user? Otherwise this is a sign of a system resource exhaustion.

Please try monitoring system resources to see what is preventing Asterisk from forking or starting new threads.

dovi5988 commented 1 month ago

@mbradeen No limits that I am aware of

root@a15-c3-njr2:~# cat /proc/1196665/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            unlimited            unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             65535                65535                processes 
Max open files            1048576              1048576              files     
Max locked memory         8388608              8388608              bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       31566                31566                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
root@a15-c3-njr2:~#

The only thing I can think of is that the box is a VM and the host has some sort of issues? What else should I look at, at the OS level?

mbradeen commented 1 month ago

If this is running in a VM, the hypervisor may be limiting the number of running processes.

If not, then monitoring the system via tool like top or htop at the time this happens should show the resource bottleneck. You can also check /var/log/messages to see if there are any further error details.

From there you should be able to adjust the VM's resources accordingly, although removing one bottleneck may uncover another.

asteriskteam commented 1 month ago

This issue has been marked stale because it has been open 7 days with no feedback. Please provide feedback within 7 days or this issue will be closed.

asteriskteam commented 4 weeks ago

This issue has been closed because it has been open 14 days with no feedback.