Closed dovi5988 closed 4 weeks ago
Unfortunately the backtraces don't show anything abnormal - it shows a normal running Asterisk. I suspect this needs fresh eyes with a fresh perspective, so I'll leave this in triage for someone else to look at.
@dovi5988 when in this state, what is the output from pjsip show endpoints
?
@mbradeen I wasn't logging it at the time the issue happened. The box is "in that state now" so I was able to get the output as it is now (not from when it happened). I can add it to my bash script for the next time it happens.
a15-c3-njr2*CLI> pjsip show endpoints
Endpoint: <Endpoint/CID.....................................> <State.....> <Channels.>
I/OAuth: <AuthId/UserName...........................................................>
Aor: <Aor............................................> <MaxContact>
Contact: <Aor/ContactUri..........................> <Hash....> <Status> <RTT(ms)..>
Transport: <TransportId........> <Type> <cos> <tos> <BindAddress..................>
Identify: <Identify/Endpoint.........................................................>
Match: <criteria.........................>
Channel: <ChannelId......................................> <State.....> <Time.....>
Exten: <DialedExten...........> CLCID: <ConnectedLineCID.......>
==========================================================================================
Endpoint: CDOutbound Not in use 0 of inf
OutAuth: CDOutbound-oauth/800
Aor: CDOutbound 0
Contact: CDOutbound/sip:xx.xx.xx.10 bc8145df77 NonQual nan
Identify: CDOutbound-identify/CDOutbound
Match: xx.xx.xx.10/32
Endpoint: CT Not in use 0 of inf
OutAuth: CT-oauth/690
Aor: CT 0
Contact: CT/sip:xx.xx.xx.10 bc8145df77 NonQual nan
Identify: CT-identify/CT
Match: xx.xx.xx.10/32
Endpoint: GGOutbound Not in use 0 of inf
OutAuth: GGOutbound-oauth/55559998
Aor: GGOutbound 0
Contact: GGOutbound/sip:xx.xx.xx.10 bc8145df77 NonQual nan
Identify: GGOutbound-identify/GGOutbound
Match: xx.xx.xx.10/32
Endpoint: LBOutbound Not in use 0 of inf
OutAuth: LBOutbound-oauth/55559999
Aor: LBOutbound 0
Contact: LBOutbound/sip:xx.xx.xx.10 bc8145df77 NonQual nan
Identify: LBOutbound-identify/LBOutbound
Match: xx.xx.xx.10/32
Endpoint: TeleC Not in use 0 of inf
Aor: TeleC 0
Contact: TeleC/sip:yy.yy.yy.166 33015c6998 Avail 8.222
Contact: TeleC/sip:yy.yy.yy.10 3f89775c62 Avail 0.818
Contact: TeleC/sip:yy.yy.yy.214 49eec15449 Avail 8.312
Contact: TeleC/sip:yy.yy.yy.146 ab301b8925 Unavail nan
Contact: TeleC/sip:xx.xx.xx.10 bc8145df77 Avail 1.013
Identify: TeleC-identify/TeleC
Match: yy.yy.yy.214/32
Match: yy.yy.yy.166/32
Match: yy.yy.yy.146/32
Match: yy.yy.yy.10/32
Match: xx.xx.xx.10/32
Endpoint: endpoint-external Unavailable 79 of inf
Aor: generic-aor 0
Identify: identity-fpp/endpoint-external
Match: xx.xx.xx.10/32
Match: xx.xx.xx.29/32
Match: yy.yy.yy.10/32
Match: yy.yy.yy.10/32
Match: yy.yy.yy.11/32
Identify: identity-TeleC/endpoint-external
Match: yy.yy.yy.166/32
Match: yy.yy.yy.214/32
Match: yy.yy.yy.126/32
Match: yy.yy.yy.125/32
Match: yy.yy.yy.0/30
Match: yy.yy.yy.0/30
Objects found: 6
@dovi5988 thank you - are there also a lot of contacts for endpoint-external? How many taskprocessors are you seeing?
@mbradeen It has two identities. One with 5 individual IP's and the other has 3 individual IP's and two more entries with a /30 for each. My script runs once a minute and the task-processors are normally 0. When the issue happens they go up and never come down. In this case attached is what it looked like.
@dovi5988 I think we need a full log showing Asterisk accumulating the taskprocessors.
@mbradeen Can you advise what you mean by that? Do you want the full asterisk log with any debugging? This is what I am using now, what should be added?
#!/bin/bash
# Define the minimum free space required in GB
MIN_FREE_SPACE_GB=50
# Get the available space on the root filesystem in GB
available_space=$(df --output=avail -BG / | tail -n 1 | tr -dc '0-9')
# Check if the available space is less than the minimum required
if [ "$available_space" -lt "$MIN_FREE_SPACE_GB" ]; then
echo "Error: Less than $MIN_FREE_SPACE_GB GB free space on root file system. Exiting."
exit 1
fi
# Define the lock file path
LOCKFILE="/var/run/ast_coredumper.lock"
DUMP_INDICATOR="/var/run/DO_DUMP.lock"
/usr/bin/rm $DUMP_INDICATOR
# Run the command and capture its output
output=$(/usr/sbin/asterisk -rx 'core show taskprocessors')
# Get the current date and time
datetime=$(/usr/bin/date "+%Y-%m-%d %H:%M:%S")
# Initialize the DO_DUMP variable
DO_DUMP=0
# Process each line of the output
echo "$output" | awk -v date="$datetime" '{
# Check if the line starts with "Processor" or the "In queue" value is not zero
# Assuming "In queue" is the 3rd column; adjust $3 if necessary
if ($1 == "Processor" || ($3 != "0" && $3 != "In" && $3 != "")) {
print "["date"] " $0
}
}'
echo "$output" | while read -r line; do
if echo "$line" | awk '{if ($3 ~ /^[0-9]+$/ && $3 > 500) exit 0; else exit 1}'; then
/usr/bin/touch $DUMP_INDICATOR
# break
fi
done
# Check if the lock file exists
if [ -e "$LOCKFILE" ]; then
print "["date"] " "Lock file exists. Exiting to prevent multiple instances of ast_coredumper."
exit 1
fi
# If DO_DUMP is set to 1, run the ast_coredumper command
if [ -e "$DUMP_INDICATOR" ]; then
echo "BOOO"
# Create the lock file to prevent multiple instances
touch "$LOCKFILE"
# Run the ast_coredumper command
/var/lib/asterisk/scripts/ast_coredumper --RUNNING --no-default-search --asterisk-bin=/usr/sbin/asterisk >> /var/log/ast_coredumper.log 2>&1
echo "DONE!"
# Remove the lock file after completion
rm -f "$LOCKFILE"
/usr/bin/rm $DUMP_INDICATOR
fi
@dovi5988 - My apologies, I am referring to the Asterisk log:
https://docs.asterisk.org/Configuration/Core-Configuration/Logging-Configuration/
full => debug,error,notice,verbose,warning,trace
@mbradeen I went through the full logs for the time that I took the backtrace but I see nothing in the logs (no ongoing calls etc.). I restarted asterisk now and I am going to wait for it to fail. I don't mind adding a 100GB disk and adding extensive logging. core set debug 9 logs every rtp packet and that will add up too fast. any way to debug everything except the rtp packets?
This issue has been marked stale because it has been open 7 days with no feedback. Please provide feedback within 7 days or this issue will be closed.
The issue happened again today. I am currently traveling and hope to upload the logs and fresh dumps in the next day or so.
Hi,
Please see attatched files. As per task_proc.log the issue seems to have started between 2024-07-24 17:26:01 and 2024-07-24 17:26:01. The back trace was taken by cron job at 17:32 once the queue was larger than 500. The time stamps are EDT unless otherwise specified in the logs themselves. I replaced sensative data such as calling and called numbers, peer names etc. The full log was too large so I compressed it. core-asterisk-running-2024-07-24T21-32-02Z-locks.txt core-asterisk-running-2024-07-24T21-32-02Z-info.txt core-asterisk-running-2024-07-24T21-32-02Z-full.txt back_trace.log full.log.gz task_proc.log core-asterisk-running-2024-07-24T21-32-02Z-thread1.txt core-asterisk-running-2024-07-24T21-32-02Z-brief.txt
Hi @dovi5988
Thank you for providing the logs and trace. From the full log, it looks like there are waves of instances of Asterisk unable to create a new thread or fork. These start before the taskprocessor jumps but continue sporadically. Is there a limit on the number of threads or processors for this system or user? Otherwise this is a sign of a system resource exhaustion.
Please try monitoring system resources to see what is preventing Asterisk from forking or starting new threads.
@mbradeen No limits that I am aware of
root@a15-c3-njr2:~# cat /proc/1196665/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size unlimited unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 65535 65535 processes
Max open files 1048576 1048576 files
Max locked memory 8388608 8388608 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 31566 31566 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
root@a15-c3-njr2:~#
The only thing I can think of is that the box is a VM and the host has some sort of issues? What else should I look at, at the OS level?
If this is running in a VM, the hypervisor may be limiting the number of running processes.
If not, then monitoring the system via tool like top or htop at the time this happens should show the resource bottleneck. You can also check /var/log/messages to see if there are any further error details.
From there you should be able to adjust the VM's resources accordingly, although removing one bottleneck may uncover another.
This issue has been marked stale because it has been open 7 days with no feedback. Please provide feedback within 7 days or this issue will be closed.
This issue has been closed because it has been open 14 days with no feedback.
Severity
Minor
Versions
20.4.0, 21.1.0
Components/Modules
MusicOnHold
Operating Environment
Both asterisk versions are running on Debian 12 on VMWare
Frequency of Occurrence
Occasional
Issue Description
Hi,
This issue started with my posts to the Asterisk forums. Please see https://community.asterisk.org/t/overloaded-task-processors/101581/14 and https://community.asterisk.org/t/overloaded-task-processors-follow-up/103096.
In summary the processor for stasis/p:endpoint:PJSIP/endpoint-external would randomly start crawling up at a random time and would never go back down. There were some anecdotal issues with getting sound files but I suspect that's more of a symptom then the cause of the issue. This happens randomly. The boxes can be up for weeks with no issues and it can happen a few days after an Asterisk restart. I wrote a script that should it notice the task processor for "stasis/p:endpoint:PJSIP/endpoint-external" going up to run
/var/lib/asterisk/scripts/ast_coredumper --RUNNING --no-default-search --asterisk-bin=/usr/sbin/asterisk
. I tried to sanitize the logs as best as possible of private information, at first manually so the CLI's and the called numbers may not match up between the different logs. I have the originals if needed. core-asterisk-running-2024-06-27T00-25-05Z-thread1.txt core-asterisk-running-2024-06-27T00-25-05Z-info.txt core-asterisk-running-2024-06-27T00-25-05Z-full.txt core-asterisk-running-2024-06-27T00-25-05Z-brief.txt back_trace.txt core-asterisk-running-2024-06-27T00-25-05Z-locks.txtRelevant log output
No response
Asterisk Issue Guidelines