friends-of-freeswitch / mod_prometheus

FreeSWITCH Prometheus Instrumentation Module in Rust
21 stars 15 forks source link

freeswitch_sessions_active - Negative Value #10

Open JonHVU opened 7 years ago

JonHVU commented 7 years ago

Hi @moises-silva,

Sorry for the noise but we are testing the module on a quiet box, and the sessions active appear to have a negative value;

HELP freeswitch_sessions_active FreeSWITCH Active Sessions freeswitch_sessions_active -197 1495446539381

What could cause this?

Thanks

Jon

moises-silva commented 7 years ago

@JonHVU That's a mismatch between CHANNEL_CREATE/CHANNEL_DESTROY events. It seems the module received more channel destroy events than create. I need to fix that logic to at least validate that and not go below zero. Did you reload the module by any chance? I think this could happen if you do a module reload while there are active calls, because it won't remember the calls that are active already.

socomsystems commented 7 years ago

@moises-silva I'm getting the following negative value as well with active registrations:

HELP freeswitch_registrations_active FreeSWITCH Active Registrations freeswitch_registrations_active -26236 1499776145514

Total actual registrations on this tested switch is 12.

Any ideas?

Thanks

Troy

moises-silva commented 7 years ago

Sadly, it's buggy. I need to rewrite that to query the core explicitly for the registration counts as opposed to relying on events. The problem is starting FreeSWITCH when there's previous state (e.g registrations already in the db). I hope is not much problem, I'll try to get it done over the weekend.

socomsystems commented 7 years ago

@moises-silva thanks for the feedback! I've updated to most recent commit and am still experiencing same behavior as well as other oddities at times. Specifically on FS servers that are in PostgreSQL BDR Multi-master schema that are essentially in standby for fail-over. The primaries registrations do replicate the registration data to all in a cluster. I've also noted that on the primary, that active registrations continue to climb exponentially. Hope this helps.

Video of primary in the cluster: https://www.screencast.com/t/9aF7vw76fKe

Video of a standy: https://www.screencast.com/t/2m54fwOKsj7

I may have the queries done improperly. They are as follows:

  1. Active sessions: freeswitch_sessions_active{instance=~"$node:.*"}

  2. ASR: freeswitch_sessions_asr{instance=~"$node:.*"}

  3. Active Calls (last 12 hours): ((freeswitch_sessions_answered_total{instance=~"$node:.*"} - freeswitch_sessions_failed_total{instance=~"$node:.*"}) / (freeswitch_sessions_answered_total{instance=~"$node:.*"} )) * 100

  4. Active Registrations: freeswitch_registrations_active{instance=~"$node:.*"}

  5. Heartbeats: ((freeswitch_heartbeats_total{instance=~"$node:.*"}) / (freeswitch_heartbeats_total{instance=~"$node:.*"} )) * 100

  6. Freeswitch Regisrations Total: freeswitch_registrations_total{instance=~"$node:.*"}`

Please excuse my ignorance, I'm green with regard to Prometheus, Grafana and Rust for that matter. Loving every minute of this though.

I'm interested in seeing if I can correct this via ESL, as you mentioned querying the core via sofia request would be more accurate than log parsing, I'm vague on exactly how to go about doing it though. Your readme makes note of its ability, any chances of nudge in the right direction? I very much appreciate your quality work and other efforts regarding your project!

Once I get a handle on this, what I'd like to focus on next is being able to see registrations on a per domain basis e.g. hard sets for expected registrations for each domain as FS is a great multi-tenant platform. Then alarming on e.g. 10% or more registrations loss on a per domain basis.

Cheers, Troy

sfrique commented 7 years ago

Hello,

I also get negative for freeswitch_sessions_active. It seemed it started for a while and then it stabilized. I just started testing this. The freeswitch_sessions_active is the most important value for us now.

Great work, hopefully it will be fixed!

image

socomsystems commented 6 years ago

@moises-silva just checking in on your mod_Prometheus as its been a while. Reinstalled / compiled still seeing the same possible issues, perpetual climbing of active registrations, failures, attempts, reg totals, and heartbeat. Is this behavior as intended? Haven't viewed new commits, itching to apply your mod. So much potential! Wish I had time to gander rust. Is it now functional as presented? Am I misunderstanding the mentioned metrics? Thank you for your contribution. Cheers, Troy

moises-silva commented 6 years ago

Yeah, I wish I had the free time to spend on this but I don't. This module was an experiment to get a module written in Rust interfacing with FreeSWITCH. I'll put up a disclaimer in the README indicating it's broken and is only useful as an example of how to get a Rust module built for FreeSWITCH, but the bugs that were found have not been fixed and I can't really commit to when I'll be able to fix them (even more since I have no use for this module myself at the moment).

socomsystems commented 6 years ago

Thanks for the feedback @moises-silva

kvishnivetsky commented 5 years ago

This issue may be solved by setting gauges/counters to the value from FreeSwitch internal counters.