[bug]: Taskprocessor overload with 25k+ PJSIP endpoints with realtime and constant adding/modifying/deleting

smtcbn commented 6 months ago

Severity

Minor

Versions

GIT-18-f5c9f7d3c8

Components/Modules

chan_pjsip.so, res_stasis.so ,

Operating Environment

Ubuntu 20.04.6 LTS

Frequency of Occurrence

Occasional

Issue Description

I migrate my SIP asterisk server to PJSIP asterisk ,version 18, one week ago. After 50 k calls proccessed, PJSIP Asterisk does not respond to requests which are REGISTER and INVITE . When the asterisk does not respond , I restarted with "core restart now" , after that , when the asterisk start again thread pool size did not reduce. When I check the command "core show taskprocessors" , PJSIP asterisk has had "53212 taskprocessors". However I have another SIP asterisk, version 18, and both of them are proccessed 50 k calls, When I check SIP asterisk taskprocessors, it has had "4441 taskprocessors".

During the crash PJSIP Asterisk processors include; 25 k (pjsip/options/XXXX), 25 k (stasis/p:endpoint:PJSIP/XXXX) How can I reduce these processors ? And Why the PJSIP asterisk can not respond Regıster and INVITE packets during this time ?

Relevant log output

No response

Asterisk Issue Guidelines

[X] Yes, I have read the Asterisk Issue Guidelines

jcolp commented 6 months ago

There is insufficient information here. We would need to know the configuration (how many calls per second, how many endpoints) as well as the specific usage patterns (what is being done in the dialplan), ideally with it being able to be reproduced using given configuration and something like SIPp. As for why it does not respond to REGISTER or INVITE requests, this is to allow the backlog of work to be processed without adding to it if an overload occurs. This behavior is configurable in pjsip.conf[1]. Additionally there is no way to decrease the number of taskprocessors explicitly, the number may decrease as a result of other configuration (such as removing endpoints). This is because they are created based on the needs of various code within Asterisk. You also need to attach the actual complete output of "core show taskprocessors".

[1] https://github.com/asterisk/asterisk/blob/master/configs/samples/pjsip.conf.sample#L1329

smtcbn commented 6 months ago

taskprocessors.txt I censored my all endpoints like XXXXXXXX. our chan_pjsip asterisk gets 5 calls per seconds. And chan_PJSIP asterisk has 25 k endpoints. And also I wondered , Why my chan_sip Asterisk has 4 k taskprocessors at same CPS while chan_PJSIP asterisk has 50 k ?

jcolp commented 6 months ago

Because they are completely different implementations, and chan_sip doesn't use taskprocessor as much as PJSIP as the chan_sip code predates the very existence of taskprocessors in the first place. You can not compare them 1 to 1.

So you have 25,000 endpoints on chan_pjsip. Based on the taskprocessors output there is a lot of OPTIONS/qualify management happening. How exactly are they configured? Is qualify enabled on AORs? Are you reloading AORs/endpoints a lot? Was there a substantial number that lost connectivity at the same time?

jcolp commented 6 months ago

Oh, and if qualify is enabled - at what frequency and timeout?

smtcbn commented 6 months ago

pjsip_ps_aors I uploaded ps_aors table desing. Qualify options are not configured for both endpoints and extensions. There is no config for the frequency and timeout. I want to answer your questions ;

Is qualify enabled on AORs? No. Are you reloading AORs/endpoints a lot? We are not reloading using "module reload chan_pjsip.so" but sorcery caching is enabled like this. ==> sorcery.conf [res_pjsip] ; Realtime PJSIP configuration wizard endpoint/cache = memory_cache,object_lifetime_stale=20,object_lifetime_maximum=25,expire_on_reload=yes endpoint=realtime,ps_endpoints auth/cache=memory_cache,object_lifetime_stale=20,object_lifetime_maximum=25,expire_on_reload=yes auth=realtime,ps_auths aor/cache = memory_cache,object_lifetime_stale=20,object_lifetime_maximum=25,expire_on_reload=yes aor=realtime,ps_aors

Was there a substantial number that lost connectivity at the same time? Yes , whole endpoints was disconnected at the same time.

jcolp commented 6 months ago

Okay, so now realtime is involved, and due to staleness and lifetime maximum you've likely got tons of churn and updates going on constantly. I would suggest not having those set so low or at all on a system with so many endpoints and AORs. There is a cost to those updates. If you increase those values, or eliminate them does the issue go away?

smtcbn commented 6 months ago

I can increase lifetime time however my endpoints frequently changing . New endpoints are adding or updating or deleting. How to supply these endpoints fresh without "module reload chan_pjsip.so" or without using sorcery. And Is this our actual problem? When I increase these values asterisk will not be stuck ? And will taskprocessors are decrease?

jcolp commented 6 months ago

The number of taskprocessors is not your problem. The problem is a single taskprocessor that is getting overloaded, specifically based on "core show taskprocessors" it is the one that manages the qualify/OPTIONS support. This functionality is also used to manage endpoint status if OPTIONS isn't being done.

This scenario has now become even more complex as you've stated that endpoints are frequently changing with new ones being added/updated/deleted.

Your scenario is an uncommon off-nominal not really tested one that I haven't seen or heard of before therefore this may be your problem, or there may be other problems that haven't yet been uncovered. I don't know if increasing these values will change things.

I will accept this issue after I update the description and such. There is no time frame on when this would get looked into, or even if. You're at the point of needing optimizations/tuning/tweaking most likely.

smtcbn commented 6 months ago

Thanks for your valuable support @jcolp
I can give extra information about our usage of chan_pjsip asterisk. We are using mysql as DB. and we are using realtime asterisk. There are 25 k endpoints and 17k of them are extensions and 8 k of them are trunks. For many years we have been used asterisk with chan_sip , we had to decide to use chan_pjsip asterisk because of the eol of chan_sip. We changed our sip_users to auths, aors, and endpoints for the chan_pjsip. We restart it every day in the morning . However chan_pjsip asterisk unable to take incoming calls and registeration after 5 hours. Additionally, every endpoints can be deleted , updated or inserted at any time. So, to use newly updated endpoints in chan_pjsip , we use sorcery caching. Additionally, when I restart chan_pjsip asterisk, restart process takes 1 minute. I enabled debug mode. Firstly, chan_pjsip selecting '%' everything from ps_endpoints table, after that it selecting one by one endpoints from ps_endpoinst table again. These processes take time and chan_pjsip asterisk is starting after these processes . How can I faster these process? Is there a clue on this ?

jcolp commented 6 months ago

I'm not sure what you mean by clue - at that scale it requires code optimization/changing things around most likely.

smtcbn commented 6 months ago

I mean ,what can I change or optimization ?

jcolp commented 6 months ago

I don't have an answer for that, because you're in uncharted scaling territory that requires investigation, trial, error, etc.

smtcbn commented 6 months ago

Okay,, Thanks for your support.

alex2grad commented 6 months ago

smtcbn you should add _full_backendcache=yes for all memory_cache sorcery objects. https://www.asterisk.org/asterisk-13-8-0-highlights/

asterisk / asterisk