3scale-ops / saas-operator

3scale SaaS Operator - www.3scale.net
Apache License 2.0
8 stars 2 forks source link

Bugs sentinel metrics 2/3: standard metrics being reported ad infinitum where a role changes #199

Closed slopezz closed 2 years ago

slopezz commented 2 years ago

While working on sentinel grafana dashboard https://github.com/3scale-ops/saas-operator/pull/197 I discovered a couple of bugs:

Bug 2: standard metrics being reported ad infinitum where a role changes

All standard metrics are retrieved from sentinel commands.

For every redis-server on a given shard with a given role, there is a given timeseries database (tuple with 3 elements)

However, when there is any change of the role from a given redis-server, it is a created a new timeseriesDB for the new shard--role2--redis-server, however the old shard--role1--redis-server timeseriesDB keeps being reported with latest value, although this tuple does not exist anymore.

Example

This redis server 10.65.6.5 from shard01 was a master long time ago (in yellow) but it is now a slave (blue)

image image

Another example is the role reported time:

image

And this apply to any metric where the role label is added in https://github.com/3scale-ops/saas-operator/blob/8aafa688780f86dcb013bf8bf7fe884e1bf44d43/pkg/redis/metrics/sentinel_metrics.go

Workaround

A workaround would be to remove the role label from every metric, so the timeseries would incldue the tuple of shard-redis-server (instead of shard--role--redis-server), however I think being sentinel metrics it make sense to always have this role label.

Ideal solution

IMO, the ideal solution would be to stop reporting metrics whose tuple shard--role--redis-server in not active anymore.

slopezz commented 2 years ago

I tried to do the workaround by deleting directly the role label from every metric, so that way, to know the role of every instance we could have an specific metric with that information.

Worked OK

However it did not work, upon a failover with a role change, it only worked OK for metrics that are always available for all instances, independtly of the role (so always total of 6 timeseiresdb because there are 6 instances) , like:

Worked wrong

While metrics:

So:

So we need to reset metrics upon a failover with a role change, so only report true metrics, not obsolote metrics.