OpenSIPS / opensips

OpenSIPS is a GPL implementation of a multi-functionality SIP Server that targets to deliver a high-level technical solution (performance, security and quality) to be used in professional SIP server platforms.
https://opensips.org
Other
1.29k stars 582 forks source link

[CRASH] Frequent crashes in rest_client (openssl error functions) - fix available #3388

Open meengu opened 6 months ago

meengu commented 6 months ago

OpenSIPS version you are running

version: opensips 3.4.5 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: ffdb1b473
main.c compiled on  with cc 10

openssl        1.1.1w-0+deb11u1 amd64
libcurl4:amd64 7.74.0-1.3+deb11u11

Crash Core Dump

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fcf98a58537 in __GI_abort () at abort.c:79
#2  0x00005652d01e1cf7 in fm_free (fm=<optimized out>, p=<optimized out>, file=<optimized out>, func=<optimized out>, line=<optimized out>)
    at mem/f_malloc_dyn.h:240
#3  0x00007fcf93a17f75 in _shm_free (line=935, function=<synthetic pointer>, file=0x7fcf938e2a44 "../crypto/err/err.c", ptr=0x7fcf951c7828)
    at ../../mem/shm_mem.h:549
#4  os_free (ptr=0x7fcf951c7828, file=0x7fcf938e2a44 "../crypto/err/err.c", line=935) at openssl_helpers.h:119
#5  0x00007fcf937fd3b4 in ERR_pop_to_mark () at ../crypto/err/err.c:935
#6  0x00007fcf938a33e5 in pubkey_cb (pval=<optimized out>, pval=<optimized out>, operation=5, it=0x7fcf93967ca0 <X509_PUBKEY_it>, exarg=0x0)
    at ../crypto/x509/x_pubkey.c:48
#7  pubkey_cb (operation=operation@entry=5, pval=pval@entry=0x7fcf954871b0, it=it@entry=0x7fcf93967ca0 <X509_PUBKEY_it>, exarg=exarg@entry=0x0)
    at ../crypto/x509/x_pubkey.c:29
#8  0x00007fcf9373d689 in asn1_item_embed_d2i (pval=pval@entry=0x7fcf954871b0, in=0x7ffd2cd11478, len=<optimized out>, it=<optimized out>,
    tag=<optimized out>, tag@entry=-1, aclass=<optimized out>, aclass@entry=0, opt=0 '\000', ctx=0x7ffd2cd11840, depth=3) at ../crypto/asn1/tasn_dec.c:432
#9  0x00007fcf9373dbb6 in asn1_template_noexp_d2i (val=0x7fcf954871b0, in=0x7ffd2cd115b0, len=<optimized out>,
    tt=tt@entry=0x7fcf93971790 <X509_CINF_seq_tt+240>, opt=<optimized out>, ctx=0x7ffd2cd11840, depth=2) at ../crypto/asn1/tasn_dec.c:643
#10 0x00007fcf9373de21 in asn1_template_ex_d2i (val=val@entry=0x7fcf954871b0, in=in@entry=0x7ffd2cd115b0, inlen=<optimized out>,
    tt=tt@entry=0x7fcf93971790 <X509_CINF_seq_tt+240>, opt=<optimized out>, ctx=ctx@entry=0x7ffd2cd11840, depth=2) at ../crypto/asn1/tasn_dec.c:518
#11 0x00007fcf9373d4c8 in asn1_item_embed_d2i (pval=pval@entry=0x7ffd2cd11640, in=0x7ffd2cd11648, len=<optimized out>, it=0x7fcf93967e60 <X509_CINF_it>,
    tag=<optimized out>, tag@entry=-1, aclass=<optimized out>, aclass@entry=0, opt=0 '\000', ctx=0x7ffd2cd11840, depth=2) at ../crypto/asn1/tasn_dec.c:382
#12 0x00007fcf9373dbb6 in asn1_template_noexp_d2i (val=0x7ffd2cd11640, in=0x7ffd2cd11780, len=<optimized out>, tt=tt@entry=0x7fcf93971620 <X509_seq_tt>,
    opt=<optimized out>, ctx=0x7ffd2cd11840, depth=1) at ../crypto/asn1/tasn_dec.c:643
#13 0x00007fcf9373de21 in asn1_template_ex_d2i (val=val@entry=0x7fcf95487160, in=in@entry=0x7ffd2cd11780, inlen=<optimized out>,
    tt=tt@entry=0x7fcf93971620 <X509_seq_tt>, opt=<optimized out>, ctx=ctx@entry=0x7ffd2cd11840, depth=1) at ../crypto/asn1/tasn_dec.c:518
#14 0x00007fcf9373d4c8 in asn1_item_embed_d2i (pval=pval@entry=0x7fcf95b8c868, in=0x7ffd2cd11900, len=<optimized out>, it=it@entry=0x7fcf93967de0 <X509_it>,
    tag=<optimized out>, tag@entry=-1, aclass=<optimized out>, aclass@entry=0, opt=0 '\000', ctx=0x7ffd2cd11840, depth=1) at ../crypto/asn1/tasn_dec.c:382
#15 0x00007fcf9373df4e in ASN1_item_ex_d2i (pval=pval@entry=0x7fcf95b8c868, in=<optimized out>, len=<optimized out>, it=0x7fcf93967de0 <X509_it>,
    tag=tag@entry=-1, aclass=aclass@entry=0, opt=0 '\000', ctx=0x7ffd2cd11840) at ../crypto/asn1/tasn_dec.c:124
#16 0x00007fcf9373dfcb in ASN1_item_d2i (pval=0x7fcf95b8c868, in=<optimized out>, len=<optimized out>, it=<optimized out>) at ../crypto/asn1/tasn_dec.c:114
#17 0x00007fcf93831bac in PEM_X509_INFO_read_bio (bp=bp@entry=0x7fcf952d8540, sk=sk@entry=0x0, cb=cb@entry=0x0, u=u@entry=0x7fcf938f2a79)
    at ../crypto/pem/pem_info.c:195
#18 0x00007fcf93893093 in X509_load_cert_crl_file (type=1, file=<optimized out>, ctx=0x7fcf94f35ea0) at ../crypto/x509/by_file.c:202
#19 X509_load_cert_crl_file (ctx=0x7fcf94f35ea0, file=<optimized out>, type=<optimized out>) at ../crypto/x509/by_file.c:188
#20 0x00007fcf938931e2 in by_file_ctrl (ctx=0x7fcf94f35ea0, cmd=<optimized out>, argp=<optimized out>, argl=<optimized out>, ret=<optimized out>)
    at ../crypto/x509/by_file.c:64
#21 0x00007fcf93895faf in X509_STORE_load_locations (ctx=0x7fcf954b03f8, file=file@entry=0x5652d1112d90 "/etc/ssl/certs/ca-certificates.crt",
    path=path@entry=0x5652d10e4bb0 "/etc/ssl/certs") at ../crypto/x509/x509_d2.c:44
#22 0x00007fcf939ba3e9 in SSL_CTX_load_verify_locations (ctx=<optimized out>, CAfile=CAfile@entry=0x5652d1112d90 "/etc/ssl/certs/ca-certificates.crt",
    CApath=CApath@entry=0x5652d10e4bb0 "/etc/ssl/certs") at ../ssl/ssl_lib.c:4197
#23 0x00007fcf92b4fa03 in ossl_connect_step1 (conn=conn@entry=0x5652d11255d0, sockindex=sockindex@entry=0) at vtls/openssl.c:3065
#24 0x00007fcf92b50f4f in ossl_connect_common (conn=0x5652d11255d0, sockindex=0, nonblocking=true, done=0x7ffd2cd11d68) at vtls/openssl.c:4041
#25 0x00007fcf92b52011 in Curl_ssl_connect_nonblocking (conn=0x5652d11255d0, sockindex=0, done=0x7ffd2cd11d68) at vtls/vtls.c:362
#26 0x00007fcf92b11815 in https_connecting (done=<optimized out>, conn=<optimized out>) at http.c:1506
#27 Curl_http_connect (done=<optimized out>, conn=<optimized out>) at http.c:1433
#28 Curl_http_connect (conn=0x5652d11255d0, done=0x7ffd24 0x00007fcf92b50f4f in ossl_connect_common (conn=0x5652d11255d0, sockindex=0, nonblocking=true, done=0
#25 0x00007fcf92b52011 in Curl_ssl_connect_nonblocking (conn=0x5652d11255d0, sockindex=0, done=0x7ffd2cd11d68) at vtls/vtls.c:362
#26 0x00007fcf92b11815 in https_connecting (done=<optimized out>, conn=<optimized out>) at http.c:1506
#27 Curl_http_connect (done=<optimized out>, conn=<optimized out>) at http.c:1433
#28 Curl_http_connect (conn=0x5652d11255d0, done=0x7ffd0b960) at easy.c:683
#34 curl_easy_perform (data=data@entry=0x5652d110b960) at easy.c:702
#35 0x00007fcf92b8f86c in rest_easy_perform (handle=0x5652d110b960, url=url@entry=0x7fcf962e4230 "https://pnproxy.bc4.cz/v1/push-message",
    out_http_rc=out_http_rc@entry=0x7ffd2cd12088) at rest_methods.c:649
#36 0x00007fcf92b92e78 in rest_sync_transfer (method=method@entry=REST_CLIENT_POST, msg=msg@entry=0x7fcf962e27b0,
    url=url@entry=0x7fcf962e4230 "https://pnproxy.bc4.cz/v1/push-message", body=body@entry=0x7ffd2cd12fd0, ctype=ctype@entry=0x7ffd2cd12e00,
    body_pv=body_pv@entry=0x7fcf9626c1d0, ctype_pv=0x7fcf9626c3d0, code_pv=0x7fcf9626c5d0) at rest_methods.c:740
#37 0x00007fcf92b8c2f9 in w_rest_post (msg=0x7fcf962e27b0, url=<optimized out>, body=0x7ffd2cd12fd0, _ctype=0x7ffd2cd12fe8, body_pv=0x7fcf9626c1d0,
    ctype_pv=0x7fcf9626c3d0, code_pv=0x7fcf9626c5d0) at rest_client.c:559
#38 0x00005652d01e8525 in do_action (a=0x7fcf9626c718, msg=0x7fcf962e27b0) at action.c:1048
#39 0x00005652d01ea100 in run_action_list (a=a@entry=0x7fcf962676f0, msg=msg@entry=0x7fcf962e27b0) at action.c:190
#40 0x00005652d01ea613 in run_actions (msg=0x7fcf962e27b0, a=0x7fcf962676f0) at action.c:136
#41 run_top_route (sr=..., msg=msg@entry=0x7fcf962e27b0) at action.c:250
#42 0x00007fcf92be5db3 in route_run (route=..., msg=0x7fcf962e27b0, params=<optimized out>, event=<optimized out>) at event_route.c:346
#43 0x00007fcf92be62a3 in route_received (sender=2, param=0x7fcf956ce8e0) at route_send.c:144
#44 0x00005652d02178e9 in ipc_handle_job (fd=<optimized out>) at ipc.c:304
#45 0x00005652d03705a8 in handle_io (fm=0x7fcf962a2c38, idx=1, event_type=1) at net/net_tcp_proc.c:217
#46 0x00005652d0371d11 in io_wait_loop_epoll (h=<optimized out>, t=<optimized out>, repeat=<optimized out>) at net/../io_wait_loop.h:305
#47 tcp_worker_proc_loop () at net/net_tcp_proc.c:442
#48 0x00005652d036a406 in tcp_start_processes (chd_rank=chd_rank@entry=0x5652d04d9a58 <chd_rank>, startup_done=startup_done@entry=0x0) at net/net_tcp.c:2119
#49 0x00005652d01e4986 in main_loop () at main.c:243
#50 main (argc=<optimized out>, argv=<optimized out>) at main.c:966

What is interesting is that the crash is in err_clear():

925 int ERR_pop_to_mark(void)
926 {
927     ERR_STATE *es;
928
929     es = ERR_get_state();
930     if (es == NULL)
931         return 0;
932
933     while (es->bottom != es->top
934            && (es->err_flags[es->top] & ERR_FLAG_MARK) == 0) {
935         err_clear(es, es->top);
936         es->top = es->top > 0 ? es->top - 1 : ERR_NUM_ERRORS - 1;
937     }
938
939     if (es->bottom == es->top)
940         return 0;
941     es->err_flags[es->top] &= ~ERR_FLAG_MARK;
942     return 1;
943 }

But the code execution should not get there because printing the content of *es returns that top==bottom:

(gdb) frame 5
#5  0x00007fcf937fd3b4 in ERR_pop_to_mark () at ../crypto/err/err.c:935

(gdb) set print pretty on
(gdb) p *es
$3 = {
  err_flags = {0 <repeats 16 times>},
  err_buffer = {0 <repeats 16 times>},
  err_data = {0x0 <repeats 16 times>},
  err_data_flags = {0 <repeats 16 times>},
  err_file = {0x0 <repeats 16 times>},
  err_line = {-1 <repeats 16 times>},
  top = 0,
  bottom = 0
}

Which leads me to an idea that this is a race condition.

Looks like that opensips first initializes modules (including tls_openssl/OpenSSL in general) and then forks TCP worker processes. Initialization of modules will make openssl create ERR_STATE struct in shared memory, and this will get then shared by all workers, sometimes leading to race conditions due to absent locking when accessing the now shared struct. I don't think OPENSSL expects this data to be shared among threads/processes. Manpage of ERR_pop_to_mark indicates that it's normally a per-thread data structure.

I found a similar issue here: https://github.com/kamailio/kamailio/issues/3695

Due to this being a race condition it only happens in specific usage scenarios (either high load, or in my case due to issuing several HTTP requests via rest_client in parallel from multiple workers - push notification being sent to multiple mobile devices at once)

meengu commented 6 months ago

I've attached gdb to a few worker processes and did:

p (unsigned long long)ERR_get_state()

And it returns the same error struct pointer value for each process. This time I tried it with OpenSSL 3.3.0 9 Apr 2024 (Library: OpenSSL 3.3.0 9 Apr 2024)

meengu commented 6 months ago

Also related https://github.com/OpenSIPS/opensips/issues/3284

hamid-elaosta commented 6 months ago

Which leads me to an idea that this is a race condition.

My observations tend to agree with this. In my testing, I've found that increasing logging to 4 or 5 makes it less likely to crash than if logging is set to -1 (my default).

My related issue: https://github.com/OpenSIPS/opensips/issues/3280

meengu commented 6 months ago

I solved it with the following patch for now: https://megous.com/dl/tmp/0001-Fix-openssl-TLS-data-corruption-in-shared-memory-by-.patch

It's a bit less hacky and conceptually simpler than the solution in kamailio [1] because it uses explicit re-initialization of the shared data based on pointer values. And I did not put much thought into openssl version check, because we use just two openssl versions, but so far no issues.

[1] https://github.com/kamailio/kamailio/commit/77de86909ac8c156335e1e789bb3067932f9eff5

hamid-elaosta commented 6 months ago

Thanks for the patch and the link; I wonder how this would affect my issue since I'm using OpenSSL 3.0.2 and the patch is written for lower versions.

Unfortunately we're not currently compiling from source but use the OpenSIPs apt, it'll take several days to retool for this and test the patch but given the lack of response on this issue (I raised my ticket in January) we may be forced to go that route imminently.

meengu commented 6 months ago

Thanks for the patch and the link; I wonder how this would affect my issue since I'm using OpenSSL 3.0.2 and the patch is written for lower versions.

What makes you think so? I tested it with 3.3.0 and 1.1.1w. I have no idea about 3.0.2.

hamid-elaosta commented 6 months ago

My mistake. I glanced at the patch and saw the version number check, but didn't scroll to see the else statement.

As for the reason I'm using OpenSSL 3.0.x, that's just the current version in our base OS which is Ubuntu 22.04.

Have you tested the 3.5.x beta at all? I wonder if any progress on this issue has been made there?

meengu commented 6 months ago

I didn't notice anything that would help this in 3.5

hamid-elaosta commented 6 months ago

That's a shame. I'm going to start tooling up to build from source and take a look at applying your patch. Thanks for the info!

github-actions[bot] commented 6 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

meengu commented 6 months ago

Yes, the analysis and a proposed patch is available above. This issue is not stale.

github-actions[bot] commented 5 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

meengu commented 5 months ago

Yes, the analysis and a proposed patch is available above. This issue is not stale.

github-actions[bot] commented 4 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

meengu commented 4 months ago

Yes, the analysis and a proposed patch is available above. This issue is not stale.

meengu commented 4 months ago

I will not be keeping the stale bot at bay, so that it doesn't kill the github issue until someone applies the proposed patch. I'm unsubscribing from this issue.

github-actions[bot] commented 4 months ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

github-actions[bot] commented 3 months ago

Marking as closed due to lack of progress for more than 30 days. If this issue is still relevant, please re-open it with additional details.