OpenSIPS / opensips

OpenSIPS is a GPL implementation of a multi-functionality SIP Server that targets to deliver a high-level technical solution (performance, security and quality) to be used in professional SIP server platforms.
https://opensips.org
Other
1.26k stars 577 forks source link

[CRASH] rest_client TLS crash #1905

Open bcnewlin opened 4 years ago

bcnewlin commented 4 years ago

OpenSIPS version you are running

version: opensips 2.4.6 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, QM_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: aa89b8764
main.c compiled on 00:37:59 Nov 23 2019 with gcc 7

Crash Core Dump

https://pastebin.com/LKt2KRkg

Describe the traffic that generated the bug

Crash is caused when running async REST calls using HTTPS on an OpenSIPS that also has tls_mgm loaded. Crashes were not observed on instances making async REST HTTPS calls without tls_mgm. We plan to test with synchronous REST calls to determine if the asynchronicity (sp?) is related.

To Reproduce

  1. Start OpenSIPS with rest_client and tls_mgm loaded.
  2. Start traffic that utilizes and async REST call during processing
  3. OpenSIPS crashes

Relevant System Logs

Nov 25 14:19:00 [2751] WARNING:core:tcpconn_add_alias: possible port hijack attempt
Nov 25 14:19:00 [2751] WARNING:core:tcpconn_add_alias: alias already present and points to another connection (1488095225 : 5060 and 1488095257 : 5060)
Nov 25 14:19:00 [2751] WARNING:core:do_action: tcp alias failed
Nov 25 14:19:00 [2749] ERROR:rest_client:resume_async_http_req: connected, but transfer timed out (5s)
...
Nov 25 14:19:03 [2736] CRITICAL:core:qm_free: freeing already freed pointer, first free: tls.h: os_free(119) - aborting

OS/environment information

Additional context

stale[bot] commented 4 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

bcnewlin commented 4 years ago

After some testing we have found the crashes are only occurring when using the rest client module in async mode.

stale[bot] commented 4 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

bogdan-iancu commented 4 years ago

Hi Ben,

trying to sort out some things here: have you notice a correlation between TLS and Async usage and the crash? I mean is the crash directly correlated with the TLS or Async usage? Running TLS + sync mode will still result in a crash ?

bogdan-iancu commented 4 years ago

PS: the pastebin for the backtrace expired :(

bcnewlin commented 4 years ago

@bogdan-iancu I will try to get a new backtrace. The crashes seem to require both SIP TLS (tls_mgm) to be loaded and async rest calls to be used.

bcnewlin commented 4 years ago

Here is a new backtrace and version info.

version: opensips 2.4.6 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, F_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: aa89b8764
main.c compiled on 23:51:06 Jan  7 2020 with gcc 7

core.15464.11.1578509044.txt

bogdan-iancu commented 4 years ago

@bcnewlin , the backtrace shows a crash in the memory manage, indicating, most probably, some memory corruption. First, from the current core file, could you print:

f 4
p rb
p *rb
p rb->buffer

Also, could you better recompile with the QM_MALLOC + DBG_MALLOC flags (instead of F_MALLOC) ? this will tremendously help in confirming / detecting the memory corruption.

bcnewlin commented 4 years ago

Here is the output of those commands from this core dump:

(gdb) f 4
#4  0x00007fb40b14dd3e in _reply_light (trans=trans@entry=0x7fb41631d450,
    buf=0x7fb45075c378 "SIP/2.0 500 Server Internal Error\r\nTo:  <sip:12287281025@testacdload25.byoc.inintca.com:5061>;tag=8098-42b6033d2d1536267262c7e217474bf8\r\nFrom:  <sip:+994984065678@testacdload25.byoc.inintca.com:5061>;"..., len=431, code=code@entry=500, lock=lock@entry=1, bm=bm@entry=0x7ffde8ca2dc0,
    to_tag_len=37, to_tag=<optimized out>) at t_reply.c:367
367     rb->buffer.s = (char*)shm_resize( rb->buffer.s, buf_len );
(gdb) p rb
$1 = (struct retr_buf *) 0x7fb41631d550
(gdb) p *rb
$2 = {activ_type = 500, buffer = {s = 0x0, len = 0}, dst = {proto = 3, proto_reserved1 = 443263502, to = {s = {sa_family = 2,
        sa_data = "\023\305\022\350\035z\000\000\000\000\000\000\000"}, sin = {sin_family = 2, sin_port = 50451, sin_addr = {s_addr = 2048780306},
        sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 2, sin6_port = 50451, sin6_flowinfo = 2048780306, sin6_addr = {__in6_u = {
            __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}},
    send_sock = 0x7fb450715c80}, retr_timer = {next_tl = 0x0, prev_tl = 0x0, ld_tl = 0x0, time_out = 0, timer_list = 0x0, deleted = 0, set = 2}, fr_timer = {
    next_tl = 0x0, prev_tl = 0x0, ld_tl = 0x0, time_out = 0, timer_list = 0x0, deleted = 0, set = 2}, retr_list = FR_TIMER_LIST, my_T = 0x7fb41631d450,
  branch = 0}
(gdb) p rb->buffer
$3 = {s = 0x0, len = 0}

I will work on reproducing with the debug flags set.

bcnewlin commented 4 years ago

Here is a new core dump with the debug flags compiled in. core.7659.6.1579105943.txt

bogdan-iancu commented 4 years ago

@bcnewlin , thanks for this new backtrace - it indicates a double free triggered from the libssl. I suspect is related to all the problems induced by the latest libssl versions. @razvancrainea has a solid fixes, still on devel/trunk-only. But the fix will be backported in the next days to 3.0 and 2.4, so I would suggest to wait for it and re-take the tests after this fix is applied on 2.4.

bcnewlin commented 4 years ago

Sounds good, I will test again with the backport from #1799 once it is ready.

bcnewlin commented 4 years ago

There is good news and bad news. The good news is that with the commits from #1799 we no longer see crashes when using the rest client in async mode. The bad news is that this is most likely because all of our REST HTTPS connections are failing, both sync and async.

Sync mode:

Jan 16 17:17:30 [7541] ERROR:rest_client:rest_easy_perform: connected, but transfer timed out for https://test-service.com/callData (5s)
Jan 16 17:17:34 [7540] ERROR:rest_client:rest_easy_perform: connected, but transfer timed out for https://test-service.com/callData (5s)

Async mode:

Jan 16 19:20:38 [7471] ERROR:rest_client:start_async_http_req: connect timeout on https://test-service.com/callData (2s)
Jan 16 19:20:39 [7486] ERROR:rest_client:start_async_http_req: connect timeout on https://test-service.com/callData (2s)
Jan 16 19:20:39 [7474] ERROR:rest_client:resume_async_http_req: connected, but transfer timed out (5s)
Jan 16 19:20:41 [7474] ERROR:rest_client:resume_async_http_req: connected, but transfer timed out (5s)

HTTPS connections were working in sync mode prior to the fixes from #1799.

Here is our rest_client config:

#// REST CLIENT module
loadmodule "rest_client.so"
modparam("rest_client", "connection_timeout", 2)
modparam("rest_client", "connect_poll_interval", 20)
modparam("rest_client", "curl_timeout", 5)
modparam("rest_client", "max_async_transfers", 100)
modparam("rest_client", "ssl_verifypeer", 0)
modparam("rest_client", "ssl_verifyhost", 0)

Library versions:

# openssl version
OpenSSL 1.0.2k-fips  26 Jan 2017
# curl --version
curl 7.61.1 (x86_64-koji-linux-gnu) libcurl/7.61.1 OpenSSL/1.0.2k zlib/1.2.7 libidn2/2.3.0 libssh2/1.4.3 nghttp2/1.39.2
Release-Date: 2018-09-05
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp
Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz HTTP2 UnixSockets HTTPS-proxy Metalink

SIP TLS connections appear to be working correctly.

razvancrainea commented 4 years ago

I've just ran a quick test and it seems to run fine. Are you sure server https://test-service.com/callData is up and accepting connections? Can you try a different https server? Finally, could you fetch a tcpdump?

bcnewlin commented 4 years ago

Yes, the service is up and accepting connections. The connections were working right up until the moment we switched to the newest commits on 2.4, and they began working again as soon as we reverted back to a previous commit (8179e1f8d9d580ca3c4d1644cc225bfd0739bd03), with no changes to the remote server during that time.

I will try to collect a tcpdump.

bcnewlin commented 4 years ago

We are still investigating but for now it looks like these errors might be related to the actual deploy of any new version and not necessarily deploying this specific version. We will continue to investigate and apologies if this was a false alarm.

bogdan-iancu commented 4 years ago

@bcnewlin , please keep me updated here.

bcnewlin commented 4 years ago

@bogdan-iancu I've found a few things since my last update.

The timing of the errors is interesting. The nodes came up configured to use synchronous rest_post calls, and failures lasted from startup for about a minute:

Jan 16 18:22:01 [7430] INFO:rest_client:mod_init: Module initialized!
Jan 16 18:23:18 [7525] ERROR:rest_client:rest_easy_perform: connected, but transfer timed out for https://test-service.com/callData (5s)
Jan 16 18:23:23 [7524] ERROR:rest_client:rest_easy_perform: connected, but transfer timed out for https://test-service.com/callData (5s)
...
Jan 16 18:24:25 [7522] ERROR:rest_client:rest_easy_perform: connected, but transfer timed out for https://test-service.com/callData (5s)
Jan 16 18:24:26 [7526] ERROR:rest_client:rest_easy_perform: connected, but transfer timed out for https://test-service.com/callData (5s)

Then about an hour later we switched on a toggle to use async rest_post and got another burst of errors. This one lasted about 3 minutes.

Jan 16 19:18:44 [7525] ERROR:rest_client:resume_async_http_req: connected, but transfer timed out (5s)
Jan 16 19:18:47 [7524] ERROR:rest_client:start_async_http_req: connect timeout on https://test-service.com/callData (2s)
Jan 16 19:18:47 [7568] ERROR:rest_client:start_async_http_req: connect timeout on https://test-service.com/callData (2s)
...
Jan 16 19:21:43 [7572] ERROR:rest_client:start_async_http_req: connect timeout on https://test-service.com/callData (2s)
Jan 16 19:21:44 [7572] ERROR:rest_client:resume_async_http_req: connected, but transfer timed out (5s)

Based on this it would seem there is some initial issue or conflict with tls_mgm that is causing initial connections to fail for a time when the module is first used in either mode.

I am still attempting to get a tcpdump but the timing makes it difficult. I will let you know when I have one.

bcnewlin commented 4 years ago

So it looks like what is happening is that on startup or switch to async mode OpenSIPS opens a large number of connections all at once. Perhaps the per-process connection sharing does not work until the connection is established, or perhaps until it is used once successfully? Either way, we see very many more connections being opened than one per OpenSIPS process. This is overwhelming our downstream server's ALB. Once that all dies down, everything works fine. It's not clear to me what in the TLS-related commits would have changed this behavior.

Here is the output of a command running from just before startup and querying the number of TCP connections OpenSIPS has open to our downstream service every 5 seconds:

# while true; do netstat -tnp | grep 10.30.1 | grep opensips | wc -l; sleep 5s; done
0
0
8
20
26
41
46
66
85
97
104
116
126
127
135
147
150
161
161
166
170
180
181
180
179
175
163
154
135
120
111
91
75
59
52
45
42
40
35
35
34
34
34
34

I've also captured a pcap, though it does not show much except a large number of connections timing out. There are very few protocol related errors. I can provide the PCAP privately if you wish.

Also, looking at the number of incoming connections on the downstream service's ALB confirms the large spike in connections. Each spike here is startup of a new OpenSIPS instance: Screen Shot 2020-01-22 at 3 56 30 PM

$ opensips -V
version: opensips 2.4.6 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, QM_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: a83750d3c
main.c compiled on 17:15:47 Jan 22 2020 with gcc 7

Please let me know if more information is needed.

bcnewlin commented 4 years ago

I'm attempting to find if there is a bottleneck in the ALB, but I don't see one. It is possible the connections are timing out because OpenSIPS itself doesn't have the resources to properly manage so many connections at once.

bcnewlin commented 4 years ago

Also, since the crashes are no longer occurring we can move this investigation to another ticket if you'd like.

razvancrainea commented 4 years ago

@bcnewlin we can continue on this ticket, that's fine. Do you think that the large number of connections are putting down the ALB server?

Are those connections generated by traffic, or are they created just out of the blue? If they are generated by traffic, do they more-or-less correlate with the number of requests?

Note that the rest_client module does not do connection sharing (for parallel queries), because the HTTP protocol does not allow it - once a HTTP request is sent, we have to wait for the reply on that connection. Therefore, if you're running simultaneous async queries, all of them will create new connections, until a reply is received back. After that point, that connection can be used for further queries.

Even in sync mode, each OpenSIPS process will open a different connection to the server - this way we can ensure parallelism.

bcnewlin commented 4 years ago

"Do you think that the large number of connections are putting down the ALB server?" I don't see any evidence that the problem is on the ALB or on the terminating server behind it. It seems right now that OpenSIPS is not able to handle so many open connections along with SIP traffic and so they are "timing out" because of that. I have seen similar behavior with SIP UDP when the network buffer gets long. Because the timer threads are independent they will pop and trigger a "timeout" even when the response was received but has not yet been processed by OpenSIPS due to backlog. I believe something similar is happening here. I am not sure how to prove that though.

"Are those connections generated by traffic, or are they created just out of the blue? If they are generated by traffic, do they more-or-less correlate with the number of requests?" They are generated by traffic, but do not correlate directly with the traffic as the traffic level is constant. After OpenSIPS starts traffic is directed to it at a relatively constant rate. The large number of connections being opened and then "recovering" occurs during this steady traffic.

"Even in sync mode, each OpenSIPS process will open a different connection to the server - this way we can ensure parallelism." For the last few updates I've been looking at the problem only in sync mode to reduce the number of variables. As you say, the expectation is that each OpenSIPS process will open a connection to the server - we have our OpenSIPS configured to run 32 processes (16 UDP, 16 TCP), so the question is how do we end up with over 180 connections to the remote server?

bogdan-iancu commented 4 years ago

@bcnewlin, the rest_client will try to reuse any available (open and free) existing connection. If there is no such available connection (none existing or all existing are already suspected in an async query), a new query is to be created - this logic is per process (each process has its own pool of connection). And how many conns will created ? see https://opensips.org/html/docs/modules/2.4.x/rest_client.html#param_max_async_transfers . It sounds fair to have that burst of connects when there are no available connections. Each new call will (most probably) generate an HTTP requests, so there is high rate of HTTP requests also. And the rate of setting up HTTP connection is probably slower than the calling rate and also slower than reusing an existing connection. So, at a moment when there are no available conns, each SIP call will translate into a new HTTP connect, as there is no time to actually do conn re-usage (the prev connect attempts are not yet completed). You can throttle this by the above param (to prevent doing so many connects), but this will slow down OpenSIPS and the HTTP handling will be performed in sync mode.

bcnewlin commented 4 years ago

@bogdan-iancu The connection handling you describe makes sense and is what I expect for async mode. We are aware of the parameter you provided and have it set currently to the default of 100.

However, as I stated we are seeing this problem when making the rest calls in sync mode. In this mode, the number of open connections should be limited to the number of OpenSIPS processes. We are running with 32 process (16 each for UDP/TCP), yet OpenSIPS opened over 180 connections to the remote server. How is that possible?

bogdan-iancu commented 4 years ago

Woow, the same happens in sync mode too ? Let me ask @liviuchircu , the author of the module, if he can drop some lines here.

liviuchircu commented 4 years ago

@bcnewlin can you confirm netstat shows each of those 180 connections as ESTAB (established)? One theory is that the remote server can only handle 1-2 established connections at a time (maybe you're doing HTTPS, so the TLS handshake has its costs), while the remaining connection attempts are timing out and shifting to CLOSE-WAIT state, temporarily upping your OS'es "total connections" count to 180, until everything stabilizes.

bcnewlin commented 4 years ago

The remote server can certainly handle more connections than should be opened. There is no evidence that the connections are being rejected on the server end, but I have continued to investigate that to be sure.

However, that is a great point and I will reproduce the issue and use netstat instead of lsof to see the state of the connections.

bcnewlin commented 4 years ago

Just an update, I attempted reproducing this and was unsuccessful. Going to try a few more times.

bcnewlin commented 4 years ago

Ok, I was just able to reproduce and I did verify that the majority of the connections are in ESTABLISHED state. Far more than should be allowed to be open.

# netstat -t | grep :https | grep ip-10-30
tcp        0      0 ip-10-32-18-155.e:57934 ip-10-30-187-185.:https ESTABLISHED
tcp        1      0 ip-10-32-18-155.e:57726 ip-10-30-187-185.:https CLOSE_WAIT
tcp        0      0 ip-10-32-18-155.e:57750 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57936 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33486 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33506 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33550 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57912 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33714 ip-10-30-199-69.e:https ESTABLISHED
tcp        1      0 ip-10-32-18-155.e:33470 ip-10-30-199-69.e:https CLOSE_WAIT
tcp        0      0 ip-10-32-18-155.e:57958 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57888 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57938 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33704 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33660 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57792 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33766 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33532 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33568 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57996 ip-10-30-187-185.:https ESTABLISHED
tcp        1      0 ip-10-32-18-155.e:57704 ip-10-30-187-185.:https CLOSE_WAIT
tcp        0      0 ip-10-32-18-155.e:57980 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57952 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33584 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33518 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:58004 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57844 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33674 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33616 ip-10-30-199-69.e:https ESTABLISHED
tcp        1      0 ip-10-32-18-155.e:57694 ip-10-30-187-185.:https CLOSE_WAIT
tcp        0      0 ip-10-32-18-155.e:33638 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57756 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:58010 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57438 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57850 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57926 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33174 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33672 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33574 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33690 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57772 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33508 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57738 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57838 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57742 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33540 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33482 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33702 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33502 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57846 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33650 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:58000 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33748 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57988 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57796 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33474 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33598 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57864 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57802 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57946 ip-10-30-187-185.:https ESTABLISHED
tcp        1      0 ip-10-32-18-155.e:57720 ip-10-30-187-185.:https CLOSE_WAIT
tcp        0      0 ip-10-32-18-155.e:33510 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57866 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57822 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57748 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33546 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33566 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33666 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57810 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57734 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33720 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57916 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33590 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57870 ip-10-30-187-185.:https ESTABLISHED
tcp        1      0 ip-10-32-18-155.e:33466 ip-10-30-199-69.e:https CLOSE_WAIT
tcp        0      0 ip-10-32-18-155.e:33600 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33594 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57872 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33576 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57426 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33744 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57950 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:58014 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33618 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33740 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57782 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57754 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57746 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33652 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33504 ip-10-30-199-69.e:https ESTABLISHED
tcp        1      0 ip-10-32-18-155.e:33454 ip-10-30-199-69.e:https CLOSE_WAIT
tcp        0      0 ip-10-32-18-155.e:57430 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:58018 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33170 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57868 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57992 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33478 ip-10-30-199-69.e:https ESTABLISHED
tcp        1      0 ip-10-32-18-155.e:33472 ip-10-30-199-69.e:https CLOSE_WAIT
tcp        0      0 ip-10-32-18-155.e:33636 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57902 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33716 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33728 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57778 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57970 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:58008 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57940 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33562 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57862 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33582 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57806 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57878 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57818 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57928 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57976 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33632 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33724 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33762 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33664 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33758 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33626 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57758 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57752 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57830 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33554 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33602 ip-10-30-199-69.e:https ESTABLISHED
tcp        1      0 ip-10-32-18-155.e:57718 ip-10-30-187-185.:https CLOSE_WAIT
tcp        0      0 ip-10-32-18-155.e:33736 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57984 ip-10-30-187-185.:https ESTABLISHED
tcp        1      0 ip-10-32-18-155.e:33456 ip-10-30-199-69.e:https CLOSE_WAIT
tcp        0      0 ip-10-32-18-155.e:33596 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57794 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57904 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33558 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33684 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57954 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33732 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33648 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33556 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33542 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33512 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33754 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33634 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57828 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33640 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57836 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57900 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57920 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33516 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:57780 ip-10-30-187-185.:https ESTABLISHED
tcp        0      0 ip-10-32-18-155.e:33686 ip-10-30-199-69.e:https ESTABLISHED
tcp        0      0 ip-10-32-18-:snip-slave ip-10-30-199-69.e:https ESTABLISHED

# netstat -t | grep :https | grep ip-10-30 | grep ESTABLISHED | wc -l
141

# netstat -t | grep :https | grep ip-10-30 | grep -v ESTABLISHED | wc -l
8

I'm not sure whether the rest_client is able to reuse connections based on the logical (DNS) target or based only on the IP connection info. If using the DNS the max open connections in sync mode should be the number of SIP processes, which is 32 in our case. If it is based on the IP, then there are 2 of those for this FQDN so I would expect 64 to be the max. In this reproduction there were 141 open established connections when I ran this command.

liviuchircu commented 4 years ago

If we process the output a bit:

$ cat /tmp/o | grep ESTAB | awk '{print $5}' | sort | grep 199 | wc
     71      71    1704
$ cat /tmp/o | grep ESTAB | awk '{print $5}' | sort | grep 187 | wc
     70      70    1680

you can see that there is some sort of DNS SRV balancing happening, where each of the .199 and .187 IP addresses receive an equal amount of resolves (70 / 140, 50%). So now this boils down to how libcurl's connection reusage mechanism works, since rest_client includes no TCP connection reusage code. As a very rough assumption, it seems to be looking up an existing connection using its target IP address, not its hostname.

Also, could you paste the output of opensipsctl fifo ps? Those processes may sometimes further multiply due to various conditions (extra interface, timers, default TCP settings, etc).

bcnewlin commented 4 years ago

@liviuchircu Yes, the record is balancing between the two IP addresses as I said. I am a bit confused now as @bogdan-iancu has confirmed in a previous comment that the module itself was performing connection sharing on a per-process basis. You are saying that is not the case? That it is entirely up to libcurl to manage connection re-use? Is that still per-process or is it for all processes?

If so, since the calls are being performed in sync mode OpenSIPS still should not be able to call libcurl more than 32 times concurrently, so it still doesn't make sense that libcurl could open so many connections. Even if it were opening two per call, in case of DNS failover between the two IP records, that would still be significantly less than the number actually being opened. So it seems if libcurl is responsible for the connection reuse that it is not working properly during an initial startup scenario.

I can get the opensipsctl fifo ps, however we are only ever using the rest_client during call processing (from request_route), so only the SIP listener processes should be able to open these connections.

bcnewlin commented 4 years ago

I should add that after this initial burst the number of connections always eventually reverts back to 32, same as the number of processes, so it would appear that whatever is doing the connection re-use is handling the DNS as one, despite the IP balancing.

liviuchircu commented 4 years ago

@bcnewlin I dug a bit for any ongoing similar issues, (here is an identical one, which was fixed), but without much luck. Your 7.61 libcurl should include the fix from my linked thread.

As a next step, I would try to isolate the problem. Some ideas:

Thanks!

liviuchircu commented 4 years ago

Thank you for the pcap. The timeline is as follows:

So it seems like this issue is strongly tied to TLS. I need to understand more about the issue (maybe attempt a local reproduce) before I can give more advice.

stale[bot] commented 4 years ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

stale[bot] commented 4 years ago

Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.

bcnewlin commented 4 years ago

@liviuchircu Have you had any opportunity to investigate? Do you want to move this to a separate ticket?

liviuchircu commented 4 years ago

Sorry for the dead time, @bcnewlin , I lost track of this thread.

Do you want me to invest some time on attempting a reproduce this "slow TLS handshake during the first 60s after an OpenSIPS restart" issue or would you like to do more localized tests (perhaps a basic setup: rest_client + HTTPS server) in an attempt to gain more information on the issue at hand? Useful questions to answer for now:

Since the code seems to work after some time, I'm having a hard time blaming a certain module, library or piece of code off the top of my head. Let me know how you would like to proceed. Thank you!

bcnewlin commented 4 years ago

I certainly would like to do some localized testing with a more basic client, however I don't know for sure when I will be able to do so. Things have gotten very busy these days. Any time you have to try to reproduce would be extremely helpful.

My biggest concern with this is that it appears to be TLS related so it may not just affect the rest_client. Having huge numbers of HTTPS connections fail on startup is an annoyance; if the same thing happens with SIP TLS connection that would be very bad.

liviuchircu commented 4 years ago

@bcnewlin I invested some time into this yesterday, but I was not able to reproduce yet. I enabled tls_mgm, then ran up to 500 CPS through the proxy in an attempt to get the TLS connections to set up slowly towards an SSL-enabled apache server, following an OpenSIPS restart. Everything went good, despite any sudden restarts, with all async POST requests flowing well. Here is the opensips.cfg I was using:

log_level = 3
log_stderror = yes
log_facility = LOG_LOCAL0

memdump = 2
udp_workers = 10

auto_aliases = no
enable_asserts = true
abort_on_assert = true
max_while_loops = 10000
xlog_buf_size = 1048576

listen = udp:localhost:5060
listen = tls:localhost:5061

####### Modules Section ########

mpath = "modules/"

loadmodule "mi_fifo.so"
modparam("mi_fifo", "fifo_mode", 0666)

loadmodule "httpd.so"
loadmodule "mi_http.so"

loadmodule "proto_udp.so"
loadmodule "proto_tls.so"

loadmodule "rest_client.so"
modparam("rest_client", "ssl_verifypeer", 0)
modparam("rest_client", "ssl_verifyhost", 0)

loadmodule "tm.so"
loadmodule "tls_mgm.so"

route {
    $var(send) = "";
    async(rest_post("https://192.168.0.203", $var(send), , $var(out)), resume_apache);
}

route [resume_apache]
{
    xlog("XXX received page of size [$(var(out){s.len})]\n");
}
balajeesv commented 4 years ago

@liviuchircu Could you please share, what was memory consumed by the worker processes in your last test with 500 CPS and consumed memory would come down at some point?

bcnewlin commented 4 years ago

@liviuchircu Thank you for the info. I will try to reproduce again on my end to see if it is still occurring.