Open rtenagom opened 6 years ago
Thanks for your bugreport. I'm not entirelysure (yet) where the original cause comes from, but I'll try to have a look at this in the next days :-)
Should it be like
Sentinel.scala
def withSlavesAddr[T](initFunction: Seq[(String, Int)] => T): T = {
import scala.concurrent.duration._
val fslaves = Future.sequence(sentinelClients.values.map(_.slaves(master)))
.map { lm =>
val ipPortBuilder = Set.newBuilder[(String, Int)]
for {
slaves <- lm
slave <- slaves
ip <- slave.get("ip")
port <- slave.get("port")
isDown = slave.get("flags").exists(_.split(",").contains("s_down")) // check s_down flag
if !isDown // filter out inactive slaves
} yield {
ipPortBuilder += ip -> port.toInt
}
initFunction(ipPortBuilder.result().toSeq)
}
Await.result(fslaves, 15 seconds)
}
sorry for the delay, your change seems fairly reasonable to me :) I'll give it a try and add another test case for this, then I'd push that change to a feature branch for testing purposes...
ok, didn't have sufficient time to get a working testcase (I don't think that weshould merge this without a working testcase).
@kardapoltsev is this just a proposal or did you test your patch? If yes could you write a simple testcase? Unless I'll give it a try in the next days :)
@rtenagom could you please test the change? I'm currently not sure how to properly simulate this in a more-or-less reproducible test case, so it would be cool if at least someone has confirmed this fix in his environment.
I will not have a time at least next two weeks to test it :(
We found a workaround some weeks ago (basically refreshing the list of slaves in sentinels periodically). Sorry @Ma27 but I don't have much time to test it these days.
Implementation looks OK to me, at a first sight.
We are using a Redis HA deployment that has 3 sentinels, 1 master, and 2 slaves. When using rediscala client for scala, we use SentinelMonitoredRedisClientMasterSlaves.
There is a bug in the code that, when this client gets the list of slaves from the sentinel, it gets all the slaves that the sentinel has recorded, regardless of the flags. As you may know, a slave recorded in a sentinel has a property called
flag
, that usually contains the valueslave
. If a slave goes down, sentinels record it and also adds the flags_down
to the list of flags of that slave.So basically a sentinel could have a list of 4 slaves, but only 2 of them would be active, and the rest of them would have
flags="slave,s_down"
. The result of this is that we see in the logs that the client successfully connects to the active slaves, but also tries indefinitely to connect to the non-active slaves. This does not block anything, but it pollutes the logs withTrying to connect to XYZ
and then fail withError trying to connect to XYZ
because that slave is already not active. I'm guessing that one actor is created per slave to connect, so these actors are left hanging around trying to connect to nothing.Thanks.