OpenSIPS / opensips

OpenSIPS is a GPL implementation of a multi-functionality SIP Server that targets to deliver a high-level technical solution (performance, security and quality) to be used in professional SIP server platforms.
https://opensips.org
Other
1.22k stars 565 forks source link

[BUG] Race condition in mid_registrar when processing re-REGISTER from Push Notification - fix available #3394

Open ohnoov opened 1 month ago

ohnoov commented 1 month ago

OpenSIPS version you are running

version: opensips 3.4.5 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: ffdb1b473
main.c compiled on  with cc 10

Describe the bug

To Reproduce

I don't think so. It's a complex setup that requires a mobile application, some push notification setup, etc. Even I can't reproduce it simply and reliably, because it requires re-REGISTER to come at the same time from multiple devices.

I'll try to eye-ball the code, but I guess it will be some lack of or incorrect locking over access to shared data (maybe usrloc).

Expected behavior

All INVITEs being sent to proper contacts.

OS/environment information

Probably irrelevant, but Debian 12 + manual opensips build.

ohnoov commented 1 month ago

Looks like the absent locking (or more like unbalanced and broken) is in this function for the case where reg_mode is not MID_REG_THROTTLE_AOR.

https://github.com/OpenSIPS/opensips/blob/master/modules/mid_registrar/lookup.c#L50

There's ul.unlock_udomain call at the end, but no corresponding call to ul.lock_udomain before the function starts accessing usrloc records for the domain.

https://github.com/OpenSIPS/opensips/blob/master/modules/mid_registrar/lookup.c#L148C2-L148C29

I'll try adding ul.lock_udomain somewhere in there to see if it will help.

ohnoov commented 1 month ago

Hm, get_ucontact_from_id keeps the AoR locked if it finds something. So it's not this.

ohnoov commented 1 month ago

A few more observations:

Eg. in mirror mode, I perform simultaneous calls to 6 contacts with usernames of test0 - test5 and I get in the above mentioned functions this output from a particular worker:

tm:t_inject_branch: t_inject_branch: 
     callid=Call-ID: e0bb98dcf3fdea880a59aad80
     to=To: <sip:test4@127.0.0.1:5067;ctid=3518437208893134>

tm:ul_contact_event_to_msg: injecting new branch:
    uri=<sip:test5@127.0.0.1:7805;+sip.pnsreg>, received=<sip:127.0.0.1:7805>,path=<>, qval=-1, socket=<udp:127.0.0.1:7676>, bflags...

In t_inject_branch I added a log statement for t->to. I'd expect the username in t->to to match the username in R-URI selected for the new branch.

So somehow, sometimes when using multiple workers, wrong list of AVPs are selected when t_inject_branch is called.

ohnoov commented 1 month ago

Allright, this patch fixes the issue. The cause is described in the patch.

https://megous.com/dl/tmp/0001-Prevent-overlapping-modifications-of-pn_ebr_filters-.patch

ohnoov commented 1 month ago

There are other options, like allocating the template per worker, etc.

ohnoov commented 1 month ago

I'll not make pull request because github would force me into 2FA, which I dislike, so please either use the patch above, or the above suggestion.

github-actions[bot] commented 3 weeks ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

ohnoov commented 3 weeks ago

.

github-actions[bot] commented 6 days ago

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

ohnoov commented 6 days ago

Yes, the analysis and a proposed patch is available above. This issue is not stale.