Bugs sentinel metrics 2/3: standard metrics being reported ad infinitum where a role changes

While working on sentinel grafana dashboard https://github.com/3scale-ops/saas-operator/pull/197 I discovered a couple of bugs:

Bug 2: standard metrics being reported ad infinitum where a role changes

All standard metrics are retrieved from sentinel commands.

For every redis-server on a given shard with a given role, there is a given timeseries database (tuple with 3 elements)

However, when there is any change of the role from a given redis-server, it is a created a new timeseriesDB for the new shard--role2--redis-server, however the old shard--role1--redis-server timeseriesDB keeps being reported with latest value, although this tuple does not exist anymore.

Example

This redis server 10.65.6.5 from shard01 was a master long time ago (in yellow) but it is now a slave (blue)

So the yellow line is fake, is a flat line with the latest retrived value once this server was a master in the past
The correct one is the blue, showing it is a slave, while last ping replied has this specific sawtooth graph

Another example is the role reported time:

Theoretically, here we should see 6 redis-servers instances reporting their role (2 shards, 3 instances per shard)
However there have been a few failovers, with redis-servers changing their role within the same shard, and now there are 9 active timeseries
For example, this role reported time is something should grow every milisecond until role changes (which normally not happens often)
There are a couple of timeseries reporting constantly a few seconds (44s and 14s), this means there were 2 consecutive failovers, where a couple of redis-server occupied a given role for only 44s and 14s, but these 2 metrics are still being reported although they lasted just a few seconds

And this apply to any metric where the role label is added in https://github.com/3scale-ops/saas-operator/blob/8aafa688780f86dcb013bf8bf7fe884e1bf44d43/pkg/redis/metrics/sentinel_metrics.go

Workaround

A workaround would be to remove the role label from every metric, so the timeseries would incldue the tuple of shard-redis-server (instead of shard--role--redis-server), however I think being sentinel metrics it make sense to always have this role label.

Ideal solution

IMO, the ideal solution would be to stop reporting metrics whose tuple shard--role--redis-server in not active anymore.

So on a given case of a redis-server x.x.x.x from shard1, passing from master to slave role, what we would should see is:
- First reported metric shard1--master--x.x.x.x
- A failover occurs -shard1--master--x.x.x.x stops being reported (is not active anymore), and now shard1--slave--x.x.x.x is the reported (active)
- On a graph we would easily vew that redis-server *x.x.x.x** was a master and now is a slave

I tried to do the workaround by deleting directly the role label from every metric, so that way, to know the role of every instance we could have an specific metric with that information.

Worked OK

However it did not work, upon a failover with a role change, it only worked OK for metrics that are always available for all instances, independtly of the role (so always total of 6 timeseiresdb because there are 6 instances) , like:

Last OK Ping Reply or Link Pending commands
Role Reported Time

Worked wrong

While metrics:

That are only retrieved by a master like Num Slaves / Num sentinels (should be only 2 timeseries because there are 2 slaves, not 3)
That are only retrieved by a slave like Master Link Down / Slave Replication Offset(should be only 4 timeseries because there are 4 masters, not 5)

So:

For slave metrics, the slave that became a master, it keeps reporting last metric value, which should be cleared from /metrics
For master metrics, the master that became a slave, it keeps reporting last metric value, which should be cleared from /metrics

So we need to reset metrics upon a failover with a role change, so only report true metrics, not obsolote metrics.

3scale-ops / saas-operator