asterisk / asterisk

The official Asterisk Project repository.
https://www.asterisk.org
Other
1.97k stars 924 forks source link

[bug]: FATAL: unhandled exception PJLIB/No memory! #727

Open mikepultz opened 2 weeks ago

mikepultz commented 2 weeks ago

Severity

Major

Versions

21.2.0

Components/Modules

pjproject

Operating Environment

Amazon Linux 2 (RHEL)

Frequency of Occurrence

One Time

Issue Description

One of our Asterisk instances crashed today when trying to allocate memory to a pj_pool:

[2024-04-30 11:11:04] ERROR[28315]: pjproject:<?>: except.c ..!!!FATAL: unhandled exception PJLIB/No memory!

This is the first time it's happened, and we have 6 identical instances total (load balanced) that have been running for about a month, and so far it's only happened the once.

The system is not experiencing memory issues, and it doesn't appear to be a memory leak (no visible downward trend over time in on our Available memory graphs):

image

The servers are not under a high load - there was only around 120 active calls when this happened. I've included a backtrace from the core dump.

It's a production device, so I'm not able to run it under Valgrind, but I can probably provide some redacted config files if that's helpful..

Relevant log output

#0  0x00001479505e6cd0 in __longjmp_chk () from /lib64/libc.so.6
#1  0x0000147952e2952e in pj_throw_exception_ (exception_id=1) at ../src/pj/except.c:53
#2  0x0000147952d96f90 in pool_callback (pool=<optimized out>, size=<optimized out>) at ../src/pjsip/sip_endpoint.c:143
#3  0x0000147952e2d9f5 in pj_pool_create_block (size=7598263421698388000, pool=0x14792c65baa0) at ../src/pj/pool.c:62
#4  pj_pool_allocate_find (pool=0x14792c65baa0, size=7598263421698387232) at ../src/pj/pool.c:148
#5  0x0000147952e2da52 in pj_pool_alloc (pool=<optimized out>, size=<optimized out>) at ../include/pj/pool_i.h:61
#6  0x0000147952e39b59 in pj_strdup (pool=pool@entry=0x14792c65baa0, dst=dst@entry=0x147902c0f7f8, src=0x1e58c48) at ../include/pj/string_i.h:42
#7  0x0000147952d9fab9 in pjsip_tpmgr_find_local_addr2 (tpmgr=0x1e3f5c8, pool=pool@entry=0x14792c65baa0, prm=prm@entry=0x147902c0f7d0) at ../src/pjsip/sip_transport.c:1770
#8  0x000014790b63ada5 in filter_on_tx_message (tdata=0x14792c65bb48) at res_pjsip/pjsip_message_filter.c:253
#9  0x0000147952d96f3c in endpt_on_tx_msg (endpt=<optimized out>, tdata=0x14792c65bb48) at ../src/pjsip/sip_endpoint.c:1125
#10 0x0000147952d9e9a4 in pjsip_transport_send (tr=0x14793d6f9838, tdata=tdata@entry=0x14792c65bb48, addr=addr@entry=0x1479441374f4, addr_len=28, token=token@entry=0x147944137408, 
    cb=cb@entry=0x147952db02b0 <transport_callback>) at ../src/pjsip/sip_transport.c:944
#11 0x0000147952db0715 in tsx_send_msg (tsx=0x147944137408, tdata=0x14792c65bb48) at ../src/pjsip/sip_transaction.c:2351
#12 0x0000147952db296e in tsx_on_state_proceeding_uas (event=0x147902c0fa50, tsx=0x147944137408) at ../src/pjsip/sip_transaction.c:2948
#13 tsx_on_state_trying (tsx=0x147944137408, event=0x147902c0fa50) at ../src/pjsip/sip_transaction.c:2878
#14 0x0000147952db4707 in pjsip_tsx_send_msg (tsx=0x147944137408, tdata=tdata@entry=0x14792c65bb48) at ../src/pjsip/sip_transaction.c:1969
#15 0x000014790b62bf61 in ast_sip_send_stateful_response (rdata=rdata@entry=0x14793ed75598, tdata=0x14792c65bb48, sip_endpoint=sip_endpoint@entry=0x14792cf94368) at res_pjsip.c:2460
#16 0x000014790b63178f in send_options_response (rdata=0x14793ed75598, code=code@entry=200) at res_pjsip/pjsip_options.c:256
#17 0x000014790b63190f in options_on_rx_request (rdata=<optimized out>) at res_pjsip/pjsip_options.c:305
#18 0x0000147952d97ca1 in pjsip_endpt_process_rx_data (endpt=0x1bdb5a8, rdata=rdata@entry=0x14793ed75598, p=p@entry=0x14790b87a400 <param>, p_handled=p_handled@entry=0x147902c0fc1c)
    at ../src/pjsip/sip_endpoint.c:930
#19 0x000014790b6466ef in distribute (data=0x14793ed75598) at res_pjsip/pjsip_distributor.c:952
#20 0x00000000005a0bee in ast_taskprocessor_execute ()
#21 0x00000000005a732b in execute_tasks ()
#22 0x00000000005a0bee in ast_taskprocessor_execute ()
#23 0x00000000005a7c81 in worker_start ()
#24 0x00000000005af939 in dummy_start ()
#25 0x00001479524e944b in start_thread () from /lib64/libpthread.so.0
#26 0x00001479505d652f in clone () from /lib64/libc.so.6

Asterisk Issue Guidelines

jcolp commented 2 weeks ago

It's not a memory issue. It's trying to allocate an absurdly large amount, due to likely reading a value from freed memory. Unless it can be reproduced it would likely be hard to isolate and identify, but the backtrace shows it having to do with a transport so information about transports in use would be needed, along with attaching a full backtrace[1].

[1] https://docs.asterisk.org/Development/Debugging/Getting-a-Backtrace/?h=backtrace

mikepultz commented 2 weeks ago

Thanks @jcolp - yes - I probably should have mentioned the insanely large allocation size ;)

Here is my current transport configuration - used by all endpoints on the system; I'll run ast_coredumper in a few minutes to get those details.

[transport](!)
type=transport
local_net=x.x.x.x/22
external_media_address=y.y.y.y
external_signaling_address=y.y.y.y
allow_reload=yes
tos=cs5            
cos=5

[transport-udp](transport)
type=transport
protocol=udp
bind=0.0.0.0:5060

[transport-udp6](transport)
type=transport
protocol=udp
bind=[::]:5060

[transport-tcp](transport)
type=transport
protocol=tcp
bind=0.0.0.0:5060

[transport-tcp6](transport)
type=transport
protocol=tcp
bind=[::]:5060

[transport-tls](transport)
type=transport
protocol=tls
bind=0.0.0.0:5061
priv_key_file=/etc/fonolo/ssl/hostname.key
cert_file=/etc/fonolo/ssl/hostname.crt
ca_list_file=/etc/pki/tls/certs/ca-bundle.crt
cipher=ECDHE-ECDSA-AES256-GCM-SHA384,DHE-DSS-AES128-GCM-SHA256,ECDHE-ECDSA-AES128-GCM-SHA256,DHE-DSS-AES256-GCM-SHA384,ECDHE-RSA-AES256-GCM-SHA384,ECDHE-RSA-AES128-GCM-SHA256,AES256-SHA256,AES128-SHA256
verify_client=no
verify_server=no
method=tlsv1_2
allow_wildcard_certs=yes

[transport-tls6](transport)
type=transport
protocol=tls
bind=[::]:5061
priv_key_file=/etc/fonolo/ssl/hostname.key
cert_file=/etc/fonolo/ssl/hostname.crt
ca_list_file=/etc/pki/tls/certs/ca-bundle.crt
cipher=ECDHE-ECDSA-AES256-GCM-SHA384,DHE-DSS-AES128-GCM-SHA256,ECDHE-ECDSA-AES128-GCM-SHA256,DHE-DSS-AES256-GCM-SHA384,ECDHE-RSA-AES256-GCM-SHA384,ECDHE-RSA-AES128-GCM-SHA256,AES256-SHA256,AES128-SHA256
verify_client=no
verify_server=no
method=tlsv1_2
allow_wildcard_certs=yes
mikepultz commented 2 weeks ago

Results from ast_coredumper:

core-asterisk-2024-04-30T15-11-04Z-brief.txt core-asterisk-2024-04-30T15-11-04Z-full.txt core-asterisk-2024-04-30T15-11-04Z-info.txt core-asterisk-2024-04-30T15-11-04Z-locks.txt core-asterisk-2024-04-30T15-11-04Z-thread1.txt

gtjoseph commented 2 weeks ago

@mikepultz If you still have that coredump around, I'd be interested in the results of the following...

# gdb /usr/sbin/asterisk <coredump>
(gdb) frame 6
(gdb) p *dst
(gdb) p *src
(gdb) q
mikepultz commented 2 weeks ago

Yup

(gdb) frame 6
#6  0x0000147952e39b59 in pj_strdup (pool=pool@entry=0x14792c65baa0, dst=dst@entry=0x147902c0f7f8, src=0x1e58c48) at ../include/pj/string_i.h:42
42              dst->ptr = (char*)pj_pool_alloc(pool, src->slen);
(gdb) p *dst
$1 = {ptr = 0x0, slen = 0}
(gdb) p *src
$2 = {ptr = 0x3066323936316332 <error: Cannot access memory at address 0x3066323936316332>, slen = 7598263421698387232}
(gdb) 
gtjoseph commented 1 week ago

@mikepultz Thanks. That just confirms that the "src" pointer parameter is either corrupted or the contents of that location are. It looks like the build is optimized but the full coredump may help anyway.

mikepultz commented 1 week ago

hey @gtjoseph - that --tarball-coredumps is pretty aggressive- it looks like it takes a copy of most of my instance, including contents of root's and users home directories- I can't share that data from a production system.

I can include a tar of the asterisk binaries, all libraries on the system, and the core dump if that gives you what you need?

I definitely don't have the debug symbols on the system- it's a custom build that we package for our environment; I can include the build string as well if that's helpful?

Mike

gtjoseph commented 1 week ago

that --tarball-coredumps is pretty aggressive- it looks like it takes a copy of most of my instance, including contents of root's and users home directories- I can't share that data from a production system.

Eh what??? --tarball-coredumps should only grab the coredump itself, the *.txt files, the asterisk binary, the modules, and /etc/os-release. It should never try root or home directories or anything else. Not even /etc/asterisk. I know it works fine on RHEL but I wonder if Amazon Linux does something goofy with the directory layout.

In any case, what we'd really need is the asterisk binary and modules, the accompanying debug symbols if the binaries are stripped, the coredump itself, and /etc/os-release. From that we can usually spin up a matching docker container, copy in the binaries and symbols, and run gdb. The symbols are really important though. Are you certain they're not available? You wouldn't have been able to run that gdb command snippet I gave you without them. What does the get_binaries.sh script produce?

mikepultz commented 1 week ago

Ok- so it's doing something really weird then; it was taking a while to run, so I checked the process list, which showed:

image

and when I looked in that /tmp directory, I saw:

[mike@sip4.us1 ~]$ cd /tmp/core-asterisk-2024-04-30T15-11-04Z.output/
[mike@sip4.us1 core-asterisk-2024-04-30T15-11-04Z.output]$ ls -la
total 32
drwxr-xr-x 13 root root  169 May  6 13:17 .
drwxrwxrwt 11 root root 4096 May  6 13:22 ..
lrwxrwxrwx  1 root root    7 Mar  3  2021 bin -> usr/bin
dr-xr-xr-x  4 root root 4096 Apr  1 15:28 boot
drwxr-xr-x 14 root root 4096 Apr  8 22:04 dev
drwxr-xr-x 97 root root 8192 Apr 30 14:15 etc
drwxr-xr-x  7 root root   78 Jan  8 13:25 home
lrwxrwxrwx  1 root root    7 Mar  3  2021 lib -> usr/lib
lrwxrwxrwx  1 root root    9 Mar  3  2021 lib64 -> usr/lib64
drwxr-xr-x  2 root root    6 Mar  3  2021 local
drwxr-xr-x  2 root root    6 Apr  9  2019 media
drwxr-xr-x  2 root root    6 Apr  9  2019 mnt
drwxr-xr-x  4 root root   27 Mar  3  2021 opt
drwx------  8 root root  118 May  6 13:17 proc
drwxr-xr-x  2 root root 4096 May  6 13:17 tmp
drwxr-xr-x  3 root root   19 May  6 13:17 usr

get_binaries.sh only finds the os-release file

The main issue is that we use our own RPM package for our systems- I'll see if I can build a debuginfo RPM for our package and then tar everything up to send over.

Mike

gtjoseph commented 1 week ago

That's weird. I'm sure it doesn't have anything to do with this issue but I'd like to figure out why ast_coredumper isn't working in your environment. What AMI id are you using for Amazon Linux 2? Are you doing anything special with the filesystem layout or asterisk installation directories? Also, the core-asterisk-2024-04-30T15-11-04Z-info.txt file has no good info in it so can you tell me the exact version of asterisk you're running? If you're building from git, the commit-id would work. Are you applying your own patches to the source?

mikepultz commented 1 week ago

hey @gtjoseph

I just emailed in a dropbox link with the files; hopefully it has everything you need - if there's something missing or would make it easier, just let me know and I'll see if I can provide it.

RE: ast_coredumper

I've included the full asterisk install dir (including ast_coredumper) from my system, as well as a .patch file with all the changes I make, and all the config files, with the layout from my system in that dropbox file I sent- hopefully that gives you what you need.

Mike