drachtio / drachtio-server

A SIP call processing server that can be controlled via nodejs applications
https://drachtio.org
MIT License
240 stars 92 forks source link

Discrepancy between Drachtio and Sofia dialogs in Prometheus metrics #166

Open danieludy opened 3 years ago

danieludy commented 3 years ago

We have something that we can't explain. First here is the output from prometheus:

# HELP exposer_transferred_bytes_total Transferred bytes to metrics services
# TYPE exposer_transferred_bytes_total counter
exposer_transferred_bytes_total 2319113719.000000
# HELP exposer_scrapes_total Number of times metrics were scraped
# TYPE exposer_scrapes_total counter
exposer_scrapes_total 185711.000000
# HELP exposer_request_latencies Latencies of serving scrape requests, in microseconds
# TYPE exposer_request_latencies summary
exposer_request_latencies_count 185711
exposer_request_latencies_sum 79003577.000000
exposer_request_latencies{quantile="0.500000"} 419.000000
exposer_request_latencies{quantile="0.900000"} 444.000000
exposer_request_latencies{quantile="0.990000"} 444.000000
# HELP drachtio_sip_requests_in_total count of sip requests received
# TYPE drachtio_sip_requests_in_total counter
drachtio_sip_requests_in_total{method="UPDATE"} 2.000000
drachtio_sip_requests_in_total{method="OPTIONS"} 33325.000000
drachtio_sip_requests_in_total{method="CANCEL"} 741155.000000
drachtio_sip_requests_in_total{method="BYE"} 611447.000000
drachtio_sip_requests_in_total{method="INFO"} 16.000000
drachtio_sip_requests_in_total{method="ACK"} 1085426.000000
drachtio_sip_requests_in_total{method="INVITE"} 1338138.000000
# HELP drachtio_sip_requests_out_total count of sip requests sent
# TYPE drachtio_sip_requests_out_total counter
drachtio_sip_requests_out_total{method="OPTIONS"} 5001.000000
drachtio_sip_requests_out_total{method="BYE"} 607849.000000
drachtio_sip_requests_out_total{method="ACK"} 727923.000000
drachtio_sip_requests_out_total{method="INVITE"} 1510068.000000
# HELP drachtio_sip_responses_in_total count of sip responses received
# TYPE drachtio_sip_responses_in_total counter
drachtio_sip_responses_in_total{code="500",method="BYE"} 1.000000
drachtio_sip_responses_in_total{code="503",method="BYE"} 22.000000
drachtio_sip_responses_in_total{code="478",method="INVITE"} 38.000000
drachtio_sip_responses_in_total{code="404",method="BYE"} 2.000000
drachtio_sip_responses_in_total{code="401",method="INVITE"} 27.000000
drachtio_sip_responses_in_total{code="502",method="INVITE"} 408.000000
drachtio_sip_responses_in_total{code="482",method="INVITE"} 1.000000
drachtio_sip_responses_in_total{code="500",method="INVITE"} 42704.000000
drachtio_sip_responses_in_total{code="504",method="INVITE"} 2197.000000
drachtio_sip_responses_in_total{code="481",method="INVITE"} 202.000000
drachtio_sip_responses_in_total{code="480",method="INVITE"} 34431.000000
drachtio_sip_responses_in_total{code="406",method="INVITE"} 6.000000
drachtio_sip_responses_in_total{code="404",method="OPTIONS"} 4070.000000
drachtio_sip_responses_in_total{code="183",method="INVITE"} 748895.000000
drachtio_sip_responses_in_total{code="181",method="INVITE"} 8386.000000
drachtio_sip_responses_in_total{code="603",method="INVITE"} 63.000000
drachtio_sip_responses_in_total{code="604",method="INVITE"} 17.000000
drachtio_sip_responses_in_total{code="200",method="BYE"} 604281.000000
drachtio_sip_responses_in_total{code="403",method="INVITE"} 118439.000000
drachtio_sip_responses_in_total{code="481",method="BYE"} 269.000000
drachtio_sip_responses_in_total{code="487",method="INVITE"} 272225.000000
drachtio_sip_responses_in_total{code="486",method="INVITE"} 171978.000000
drachtio_sip_responses_in_total{code="404",method="INVITE"} 48726.000000
drachtio_sip_responses_in_total{code="491",method="INVITE"} 875.000000
drachtio_sip_responses_in_total{code="408",method="INVITE"} 21444.000000
drachtio_sip_responses_in_total{code="180",method="INVITE"} 604337.000000
drachtio_sip_responses_in_total{code="415",method="INVITE"} 30.000000
drachtio_sip_responses_in_total{code="488",method="INVITE"} 76732.000000
drachtio_sip_responses_in_total{code="407",method="INVITE"} 7.000000
drachtio_sip_responses_in_total{code="484",method="INVITE"} 293.000000
drachtio_sip_responses_in_total{code="200",method="OPTIONS"} 931.000000
drachtio_sip_responses_in_total{code="200",method="INVITE"} 718359.000000
drachtio_sip_responses_in_total{code="410",method="INVITE"} 169.000000
drachtio_sip_responses_in_total{code="408",method="BYE"} 455.000000
drachtio_sip_responses_in_total{code="501",method="INVITE"} 830.000000
# HELP drachtio_sip_responses_out_total count of sip responses sent
# TYPE drachtio_sip_responses_out_total counter
drachtio_sip_responses_out_total{code="500",method="INFO"} 12.000000
drachtio_sip_responses_out_total{code="200",method="INFO"} 4.000000
drachtio_sip_responses_out_total{code="481",method="CANCEL"} 157.000000
drachtio_sip_responses_out_total{code="500",method="INVITE"} 79.000000
drachtio_sip_responses_out_total{code="200",method="OPTIONS"} 33325.000000
drachtio_sip_responses_out_total{code="200",method="UPDATE"} 2.000000
drachtio_sip_responses_out_total{code="400",method="INVITE"} 17687.000000
drachtio_sip_responses_out_total{code="200",method="CANCEL"} 33117.000000
drachtio_sip_responses_out_total{code="406",method="INVITE"} 6.000000
drachtio_sip_responses_out_total{code="183",method="INVITE"} 745291.000000
drachtio_sip_responses_out_total{code="181",method="INVITE"} 8379.000000
drachtio_sip_responses_out_total{code="487",method="INVITE"} 33123.000000
drachtio_sip_responses_out_total{code="200",method="INVITE"} 728929.000000
drachtio_sip_responses_out_total{code="486",method="INVITE"} 171730.000000
drachtio_sip_responses_out_total{code="180",method="INVITE"} 603709.000000
drachtio_sip_responses_out_total{code="415",method="INVITE"} 46.000000
drachtio_sip_responses_out_total{code="604",method="INVITE"} 359.000000
drachtio_sip_responses_out_total{code="200",method="BYE"} 606780.000000
drachtio_sip_responses_out_total{code="603",method="INVITE"} 1929.000000
drachtio_sip_responses_out_total{code="403",method="INVITE"} 7637.000000
drachtio_sip_responses_out_total{code="481",method="BYE"} 2555.000000
drachtio_sip_responses_out_total{code="404",method="INVITE"} 48645.000000
drachtio_sip_responses_out_total{code="408",method="INVITE"} 206.000000
drachtio_sip_responses_out_total{code="491",method="INVITE"} 875.000000
drachtio_sip_responses_out_total{code="488",method="INVITE"} 10337.000000
drachtio_sip_responses_out_total{code="401",method="INVITE"} 37422.000000
drachtio_sip_responses_out_total{code="404",method="BYE"} 45.000000
drachtio_sip_responses_out_total{code="480",method="INVITE"} 35381.000000
drachtio_sip_responses_out_total{code="481",method="INVITE"} 143.000000
# HELP drachtio_build_info drachtio version running
# TYPE drachtio_build_info counter
drachtio_build_info{version="v0.8.9"} 1.000000
# HELP drachtio_time_started drachtio start time
# TYPE drachtio_time_started gauge
drachtio_time_started 1624746260.000000
# HELP drachtio_stable_dialogs count of SIP dialogs in progress
# TYPE drachtio_stable_dialogs gauge
drachtio_stable_dialogs{type="outbound"} 40.000000
drachtio_stable_dialogs{type="inbound"} 40.000000
# HELP drachtio_proxy_cores count of proxied call setups in progress
# TYPE drachtio_proxy_cores gauge
drachtio_proxy_cores 0.000000
# HELP drachtio_registered_endpoints count of registered endpoints
# TYPE drachtio_registered_endpoints gauge
drachtio_registered_endpoints 0.000000
# HELP drachtio_app_connections count of connections to drachtio applications
# TYPE drachtio_app_connections gauge
drachtio_app_connections 394.000000
# HELP drachtio_sofia_client_txn_hash_size current size of sofia hash table for client transactions
# TYPE drachtio_sofia_client_txn_hash_size gauge
drachtio_sofia_client_txn_hash_size 32767.000000
# HELP drachtio_sofia_server_txn_hash_size current size of sofia hash table for server transactions
# TYPE drachtio_sofia_server_txn_hash_size gauge
drachtio_sofia_server_txn_hash_size 8191.000000
# HELP drachtio_sofia_dialog_hash_size current size of sofia hash table for dialogs
# TYPE drachtio_sofia_dialog_hash_size gauge
drachtio_sofia_dialog_hash_size 2047.000000
# HELP drachtio_sofia_server_txns count of sofia server-side transactions
# TYPE drachtio_sofia_server_txns gauge
drachtio_sofia_server_txns 3063.000000
# HELP drachtio_sofia_client_txns count of sofia client-side transactions
# TYPE drachtio_sofia_client_txns gauge
drachtio_sofia_client_txns 10926.000000
# HELP drachtio_sofia_dialogs count of sofia dialogs
# TYPE drachtio_sofia_dialogs gauge
drachtio_sofia_dialogs 1337.000000
# HELP drachtio_sofia_msgs_recv count of sip messages received by sofia sip stack
# TYPE drachtio_sofia_msgs_recv gauge
drachtio_sofia_msgs_recv 8866630.000000
# HELP drachtio_sofia_msgs_sent count of sip messages sent by sofia sip stack
# TYPE drachtio_sofia_msgs_sent gauge
drachtio_sofia_msgs_sent 8901691.000000
# HELP drachtio_sofia_requests_recv_total count of sip requests received by sofia sip stack
# TYPE drachtio_sofia_requests_recv_total gauge
drachtio_sofia_requests_recv_total 3600581.000000
# HELP drachtio_sofia_requests_sent count of sip requests sent by sofia sip stack
# TYPE drachtio_sofia_requests_sent gauge
drachtio_sofia_requests_sent 3943593.000000
# HELP drachtio_sofia_bad_msgs_recv count of invalid sip messages received by sofia sip stack
# TYPE drachtio_sofia_bad_msgs_recv gauge
drachtio_sofia_bad_msgs_recv 0.000000
# HELP drachtio_sofia_bad_reqs_recv count of invalid sip requests received by sofia sip stack
# TYPE drachtio_sofia_bad_reqs_recv gauge
drachtio_sofia_bad_reqs_recv 0.000000
# HELP drachtio_sofia_retransmitted_requests count of sip requests retransmitted by sofia sip stack
# TYPE drachtio_sofia_retransmitted_requests gauge
drachtio_sofia_retransmitted_requests 26461.000000
# HELP drachtio_sofia_retransmitted_responses count of sip responses retransmitted by sofia sip stack
# TYPE drachtio_sofia_retransmitted_responses gauge
drachtio_sofia_retransmitted_responses 0.000000
# HELP drachtio_call_answer_seconds_in call answer time in seconds for calls received
# TYPE drachtio_call_answer_seconds_in histogram
drachtio_call_answer_seconds_in_count 606872
drachtio_call_answer_seconds_in_sum 5552415.357048
drachtio_call_answer_seconds_in_bucket{le="1.000000"} 205061
drachtio_call_answer_seconds_in_bucket{le="3.000000"} 265671
drachtio_call_answer_seconds_in_bucket{le="6.000000"} 326108
drachtio_call_answer_seconds_in_bucket{le="10.000000"} 391802
drachtio_call_answer_seconds_in_bucket{le="15.000000"} 453937
drachtio_call_answer_seconds_in_bucket{le="20.000000"} 484259
drachtio_call_answer_seconds_in_bucket{le="30.000000"} 549221
drachtio_call_answer_seconds_in_bucket{le="60.000000"} 606798
drachtio_call_answer_seconds_in_bucket{le="+Inf"} 606872
# HELP drachtio_call_answer_seconds_out call answer time in seconds for calls sent
# TYPE drachtio_call_answer_seconds_out histogram
drachtio_call_answer_seconds_out_count 607281
drachtio_call_answer_seconds_out_sum 6734035.634711
drachtio_call_answer_seconds_out_bucket{le="1.000000"} 55176
drachtio_call_answer_seconds_out_bucket{le="3.000000"} 199729
drachtio_call_answer_seconds_out_bucket{le="6.000000"} 286976
drachtio_call_answer_seconds_out_bucket{le="10.000000"} 348815
drachtio_call_answer_seconds_out_bucket{le="15.000000"} 433469
drachtio_call_answer_seconds_out_bucket{le="20.000000"} 472137
drachtio_call_answer_seconds_out_bucket{le="30.000000"} 543824
drachtio_call_answer_seconds_out_bucket{le="60.000000"} 607176
drachtio_call_answer_seconds_out_bucket{le="+Inf"} 607281
# HELP drachtio_call_pdd_seconds_in call post-dial delay seconds for calls received
# TYPE drachtio_call_pdd_seconds_in histogram
drachtio_call_pdd_seconds_in_count 887438
drachtio_call_pdd_seconds_in_sum 395.756321
drachtio_call_pdd_seconds_in_bucket{le="1.000000"} 887438
drachtio_call_pdd_seconds_in_bucket{le="2.000000"} 887438
drachtio_call_pdd_seconds_in_bucket{le="3.000000"} 887438
drachtio_call_pdd_seconds_in_bucket{le="5.000000"} 887438
drachtio_call_pdd_seconds_in_bucket{le="7.000000"} 887438
drachtio_call_pdd_seconds_in_bucket{le="10.000000"} 887438
drachtio_call_pdd_seconds_in_bucket{le="15.000000"} 887438
drachtio_call_pdd_seconds_in_bucket{le="20.000000"} 887438
drachtio_call_pdd_seconds_in_bucket{le="+Inf"} 887438
# HELP drachtio_call_pdd_seconds_out call post-dial delay seconds for calls received
# TYPE drachtio_call_pdd_seconds_out histogram
drachtio_call_pdd_seconds_out_count 959172
drachtio_call_pdd_seconds_out_sum 1780379.981887
drachtio_call_pdd_seconds_out_bucket{le="1.000000"} 239481
drachtio_call_pdd_seconds_out_bucket{le="2.000000"} 635713
drachtio_call_pdd_seconds_out_bucket{le="3.000000"} 850382
drachtio_call_pdd_seconds_out_bucket{le="5.000000"} 931063
drachtio_call_pdd_seconds_out_bucket{le="7.000000"} 945958
drachtio_call_pdd_seconds_out_bucket{le="10.000000"} 952120
drachtio_call_pdd_seconds_out_bucket{le="15.000000"} 956088
drachtio_call_pdd_seconds_out_bucket{le="20.000000"} 958570
drachtio_call_pdd_seconds_out_bucket{le="+Inf"} 959172

Our question is how/why are the drachtio_sofia_dialogs and drachtio_stable_dialogs counts so different? I would have assumed that a sofia dialog should correlate to a drachtio dialog. Is there something I am missing here? Side note, we noticed this because the drachtio_sofia_dialogs seemed to be slowly leaking over time. It would go up and down with load but the baseline would steadily increase.

davehorton commented 3 years ago

On the latter point, there may be a slow memory leak of some kind where a sip dialog is not being freed by drachtio in certain situations. I am trying to track these down and fix them as they are identified. If you are able to identify specific call scenarios where the drachtio_stable_dialogs value does not decrement after the call is complete, please do open an issue on that.

On the first question, drachtio_stable_dialogs represents dialogs in progress (active), I am actually not sure if drachtio_sofia_dialogs (which is reported by the sofia stack) is meant to be cumulative or active. I will have a look.

danieludy commented 3 years ago

Is there any way to dump the active dialogs within sofia to be able to see which ones are leaking? Finding which of the 1338138 calls resulted in the 1300 "leaked" dialogs would be a little hard unless there was an obvious error message to look for.

davehorton commented 3 years ago

yes if you send a SIGHUP signal to the drachtio process the next time it writes stats to the log file (every 30 secs) it will include voluminous detail on the dialogs that are active

danieludy commented 3 years ago

Thanks Dave the above worked... Now trying to go through a dialog that was stuck vs one that wasn't to find the difference... To try and speed things along can you tell me where the sofia dialog is added/removed? This is a B2BUA call if it makes a difference and appears to be the B leg.

davehorton commented 3 years ago

In most cases, the sofia "leg" (analagous to the SipDialog at the drachtio application level), should get destroyed in the SipDialog destructor:

https://github.com/drachtio/drachtio-server/blob/1ed34a600d25bf6c8629f0c62428d5fd4e1c0b50/src/sip-dialog.cpp#L195

when the call hangs up