Connecting to a sentinel should not take into account s_down slaves

rtenagom commented 6 years ago

We are using a Redis HA deployment that has 3 sentinels, 1 master, and 2 slaves. When using rediscala client for scala, we use SentinelMonitoredRedisClientMasterSlaves.

There is a bug in the code that, when this client gets the list of slaves from the sentinel, it gets all the slaves that the sentinel has recorded, regardless of the flags. As you may know, a slave recorded in a sentinel has a property called flag, that usually contains the value slave. If a slave goes down, sentinels record it and also adds the flag s_down to the list of flags of that slave.

So basically a sentinel could have a list of 4 slaves, but only 2 of them would be active, and the rest of them would have flags="slave,s_down". The result of this is that we see in the logs that the client successfully connects to the active slaves, but also tries indefinitely to connect to the non-active slaves. This does not block anything, but it pollutes the logs with Trying to connect to XYZ and then fail with Error trying to connect to XYZ because that slave is already not active. I'm guessing that one actor is created per slave to connect, so these actors are left hanging around trying to connect to nothing.

Thanks.

Ma27 commented 6 years ago

Thanks for your bugreport. I'm not entirelysure (yet) where the original cause comes from, but I'll try to have a look at this in the next days :-)

kardapoltsev commented 6 years ago

Should it be like

Sentinel.scala

  def withSlavesAddr[T](initFunction: Seq[(String, Int)] => T): T = {
    import scala.concurrent.duration._

    val fslaves = Future.sequence(sentinelClients.values.map(_.slaves(master)))
      .map { lm =>
        val ipPortBuilder = Set.newBuilder[(String, Int)]
        for {
          slaves <- lm
          slave <- slaves
          ip <- slave.get("ip")
          port <- slave.get("port")
          isDown = slave.get("flags").exists(_.split(",").contains("s_down")) // check s_down flag
          if !isDown // filter out inactive slaves
        } yield {
          ipPortBuilder += ip -> port.toInt
        }
        initFunction(ipPortBuilder.result().toSeq)
      }

    Await.result(fslaves, 15 seconds)
  }

Ma27 commented 6 years ago

sorry for the delay, your change seems fairly reasonable to me :) I'll give it a try and add another test case for this, then I'd push that change to a feature branch for testing purposes...

Ma27 commented 6 years ago

ok, didn't have sufficient time to get a working testcase (I don't think that weshould merge this without a working testcase).

@kardapoltsev is this just a proposal or did you test your patch? If yes could you write a simple testcase? Unless I'll give it a try in the next days :)

Ma27 commented 6 years ago

@rtenagom could you please test the change? I'm currently not sure how to properly simulate this in a more-or-less reproducible test case, so it would be cool if at least someone has confirmed this fix in his environment.

kardapoltsev commented 6 years ago

I will not have a time at least next two weeks to test it :(

rtenagom commented 6 years ago

We found a workaround some weeks ago (basically refreshing the list of slaves in sentinels periodically). Sorry @Ma27 but I don't have much time to test it these days.

Implementation looks OK to me, at a first sight.

Ma27 / rediscala

Connecting to a sentinel should not take into account s_down slaves #10