Open mikepultz opened 2 weeks ago
It's not a memory issue. It's trying to allocate an absurdly large amount, due to likely reading a value from freed memory. Unless it can be reproduced it would likely be hard to isolate and identify, but the backtrace shows it having to do with a transport so information about transports in use would be needed, along with attaching a full backtrace[1].
[1] https://docs.asterisk.org/Development/Debugging/Getting-a-Backtrace/?h=backtrace
Thanks @jcolp - yes - I probably should have mentioned the insanely large allocation size ;)
Here is my current transport configuration - used by all endpoints on the system; I'll run ast_coredumper in a few minutes to get those details.
[transport](!)
type=transport
local_net=x.x.x.x/22
external_media_address=y.y.y.y
external_signaling_address=y.y.y.y
allow_reload=yes
tos=cs5
cos=5
[transport-udp](transport)
type=transport
protocol=udp
bind=0.0.0.0:5060
[transport-udp6](transport)
type=transport
protocol=udp
bind=[::]:5060
[transport-tcp](transport)
type=transport
protocol=tcp
bind=0.0.0.0:5060
[transport-tcp6](transport)
type=transport
protocol=tcp
bind=[::]:5060
[transport-tls](transport)
type=transport
protocol=tls
bind=0.0.0.0:5061
priv_key_file=/etc/fonolo/ssl/hostname.key
cert_file=/etc/fonolo/ssl/hostname.crt
ca_list_file=/etc/pki/tls/certs/ca-bundle.crt
cipher=ECDHE-ECDSA-AES256-GCM-SHA384,DHE-DSS-AES128-GCM-SHA256,ECDHE-ECDSA-AES128-GCM-SHA256,DHE-DSS-AES256-GCM-SHA384,ECDHE-RSA-AES256-GCM-SHA384,ECDHE-RSA-AES128-GCM-SHA256,AES256-SHA256,AES128-SHA256
verify_client=no
verify_server=no
method=tlsv1_2
allow_wildcard_certs=yes
[transport-tls6](transport)
type=transport
protocol=tls
bind=[::]:5061
priv_key_file=/etc/fonolo/ssl/hostname.key
cert_file=/etc/fonolo/ssl/hostname.crt
ca_list_file=/etc/pki/tls/certs/ca-bundle.crt
cipher=ECDHE-ECDSA-AES256-GCM-SHA384,DHE-DSS-AES128-GCM-SHA256,ECDHE-ECDSA-AES128-GCM-SHA256,DHE-DSS-AES256-GCM-SHA384,ECDHE-RSA-AES256-GCM-SHA384,ECDHE-RSA-AES128-GCM-SHA256,AES256-SHA256,AES128-SHA256
verify_client=no
verify_server=no
method=tlsv1_2
allow_wildcard_certs=yes
@mikepultz If you still have that coredump around, I'd be interested in the results of the following...
# gdb /usr/sbin/asterisk <coredump>
(gdb) frame 6
(gdb) p *dst
(gdb) p *src
(gdb) q
Yup
(gdb) frame 6
#6 0x0000147952e39b59 in pj_strdup (pool=pool@entry=0x14792c65baa0, dst=dst@entry=0x147902c0f7f8, src=0x1e58c48) at ../include/pj/string_i.h:42
42 dst->ptr = (char*)pj_pool_alloc(pool, src->slen);
(gdb) p *dst
$1 = {ptr = 0x0, slen = 0}
(gdb) p *src
$2 = {ptr = 0x3066323936316332 <error: Cannot access memory at address 0x3066323936316332>, slen = 7598263421698387232}
(gdb)
@mikepultz Thanks. That just confirms that the "src" pointer parameter is either corrupted or the contents of that location are. It looks like the build is optimized but the full coredump may help anyway.
sudo /var/lib/asterisk/scripts/ast_coredumper --tarball-coredumps <coredump>
sudo ./get_binaries.sh /tmp
. It will attempt to get the debug symbols from /usr/lib/debug
if they're installed.hey @gtjoseph - that --tarball-coredumps is pretty aggressive- it looks like it takes a copy of most of my instance, including contents of root's and users home directories- I can't share that data from a production system.
I can include a tar of the asterisk binaries, all libraries on the system, and the core dump if that gives you what you need?
I definitely don't have the debug symbols on the system- it's a custom build that we package for our environment; I can include the build string as well if that's helpful?
Mike
that --tarball-coredumps is pretty aggressive- it looks like it takes a copy of most of my instance, including contents of root's and users home directories- I can't share that data from a production system.
Eh what??? --tarball-coredumps should only grab the coredump itself, the *.txt files, the asterisk binary, the modules, and /etc/os-release. It should never try root or home directories or anything else. Not even /etc/asterisk. I know it works fine on RHEL but I wonder if Amazon Linux does something goofy with the directory layout.
In any case, what we'd really need is the asterisk binary and modules, the accompanying debug symbols if the binaries are stripped, the coredump itself, and /etc/os-release. From that we can usually spin up a matching docker container, copy in the binaries and symbols, and run gdb. The symbols are really important though. Are you certain they're not available? You wouldn't have been able to run that gdb command snippet I gave you without them. What does the get_binaries.sh
script produce?
Ok- so it's doing something really weird then; it was taking a while to run, so I checked the process list, which showed:
and when I looked in that /tmp directory, I saw:
[mike@sip4.us1 ~]$ cd /tmp/core-asterisk-2024-04-30T15-11-04Z.output/
[mike@sip4.us1 core-asterisk-2024-04-30T15-11-04Z.output]$ ls -la
total 32
drwxr-xr-x 13 root root 169 May 6 13:17 .
drwxrwxrwt 11 root root 4096 May 6 13:22 ..
lrwxrwxrwx 1 root root 7 Mar 3 2021 bin -> usr/bin
dr-xr-xr-x 4 root root 4096 Apr 1 15:28 boot
drwxr-xr-x 14 root root 4096 Apr 8 22:04 dev
drwxr-xr-x 97 root root 8192 Apr 30 14:15 etc
drwxr-xr-x 7 root root 78 Jan 8 13:25 home
lrwxrwxrwx 1 root root 7 Mar 3 2021 lib -> usr/lib
lrwxrwxrwx 1 root root 9 Mar 3 2021 lib64 -> usr/lib64
drwxr-xr-x 2 root root 6 Mar 3 2021 local
drwxr-xr-x 2 root root 6 Apr 9 2019 media
drwxr-xr-x 2 root root 6 Apr 9 2019 mnt
drwxr-xr-x 4 root root 27 Mar 3 2021 opt
drwx------ 8 root root 118 May 6 13:17 proc
drwxr-xr-x 2 root root 4096 May 6 13:17 tmp
drwxr-xr-x 3 root root 19 May 6 13:17 usr
get_binaries.sh
only finds the os-release file
The main issue is that we use our own RPM package for our systems- I'll see if I can build a debuginfo RPM for our package and then tar everything up to send over.
Mike
That's weird. I'm sure it doesn't have anything to do with this issue but I'd like to figure out why ast_coredumper isn't working in your environment. What AMI id are you using for Amazon Linux 2? Are you doing anything special with the filesystem layout or asterisk installation directories? Also, the core-asterisk-2024-04-30T15-11-04Z-info.txt file has no good info in it so can you tell me the exact version of asterisk you're running? If you're building from git, the commit-id would work. Are you applying your own patches to the source?
hey @gtjoseph
I just emailed in a dropbox link with the files; hopefully it has everything you need - if there's something missing or would make it easier, just let me know and I'll see if I can provide it.
RE: ast_coredumper
amzn2-ami-hvm-2.0.20210303.0-x86_64-gp2
- but it's regularly updated, and the AMI name doesn't change - so that's not accurate; but it's an up-to-date Amazon Linux 2 instance, with kernel 5.15.152-100.162.amzn2.x86_64. There's nothing really "strange" about the system set up or layout.I've included the full asterisk install dir (including ast_coredumper) from my system, as well as a .patch file with all the changes I make, and all the config files, with the layout from my system in that dropbox file I sent- hopefully that gives you what you need.
Mike
Severity
Major
Versions
21.2.0
Components/Modules
pjproject
Operating Environment
Amazon Linux 2 (RHEL)
Frequency of Occurrence
One Time
Issue Description
One of our Asterisk instances crashed today when trying to allocate memory to a pj_pool:
[2024-04-30 11:11:04] ERROR[28315]: pjproject:<?>: except.c ..!!!FATAL: unhandled exception PJLIB/No memory!
This is the first time it's happened, and we have 6 identical instances total (load balanced) that have been running for about a month, and so far it's only happened the once.
The system is not experiencing memory issues, and it doesn't appear to be a memory leak (no visible downward trend over time in on our Available memory graphs):
The servers are not under a high load - there was only around 120 active calls when this happened. I've included a backtrace from the core dump.
It's a production device, so I'm not able to run it under Valgrind, but I can probably provide some redacted config files if that's helpful..
Relevant log output
Asterisk Issue Guidelines